Fast Coding Unit Encoding Mechanism for Low Complexity Video Coding

In high efficiency video coding (HEVC), coding tree contributes to excellent compression performance. However, coding tree brings extremely high computational complexity. Innovative works for improving coding tree to further reduce encoding time are stated in this paper. A novel low complexity coding tree mechanism is proposed for HEVC fast coding unit (CU) encoding. Firstly, this paper makes an in-depth study of the relationship among CU distribution, quantization parameter (QP) and content change (CC). Secondly, a CU coding tree probability model is proposed for modeling and predicting CU distribution. Eventually, a CU coding tree probability update is proposed, aiming to address probabilistic model distortion problems caused by CC. Experimental results show that the proposed low complexity CU coding tree mechanism significantly reduces encoding time by 27% for lossy coding and 42% for visually lossless coding and lossless coding. The proposed low complexity CU coding tree mechanism devotes to improving coding performance under various application conditions.


Introduction
A new birth after a decade of preparations, HEVC [1] has been initiated by the Joint Collaborative Team on Video Coding (JCT-VC), jointly established by the International Organization for Standardization (ISO)/ International Electrotechnical Commission (IEC) Moving Picture Experts Group (MPEG) and International Telecommunication Union-Telecommunication (ITU-T) Video Coding Experts Group (VCEG). Due to the rapid growth of multimedia services, realizing HEVC in applications while simultaneously providing high quality images with minimum transmission delay over the limited rate networks has remained a challenge [2][3]. HEVC is expected to have more impact on high resolution video (4K and 8K) or high fidelity video for high resolution displays such as high definition TV (HD-TV) and ultrahigh definition TV (UHD-TV) [4][5] in the future. HEVC not only inherits the crucial elements of H.264/ AVC, but also accepts numerous new techniques [6][7] to achieve considerable performance. Among these advanced techniques, coding tree structure is employed as one of the most powerful tools to improve the coding efficiency of HEVC. However coding tree is a complicated technique that contributes to improving compression performance at the cost of increasing huge computational complexity. Computational complexity reduction with negligible performance loss has been one of the major research concerns, because videos may be played in any kind of device such as a PC or mobile phone with different computational abilities and system resources.
Recently, a large amount of effort has been made for CU coding tree computational complexity reduction for HEVC. Studies on fast CU encoding improvement tried to represent the redundant information that often exists in CU coding tree traversal computing. Traditionally, dynamic information redundancy in videos occurs in the spatial and temporal sense. Thus, alternative mechanisms for intra mode decision and inter mode decision would be interesting. Many works have focused on the development of fast coding tree computing by early CU size decision for HEVC intra encoders [8][9][10] and inter encoders [11][12]. Shen [8] skips intra prediction modes that are rarely used in the parent CUs in the upper depth levels or spatially nearby CUs. Ahn [12] simplified the RD competition processes by selectively conducting a mode decision process according to inter predicting unit split-type, square-type, or nonsquare-type modes. These approaches effectively reduced encoding time by mostly using fast mode decision in predicting unit (PU) level and transforming unit (TU) level. However, these approaches need to traverse all CU sizes to obtain the optimal CUs; and this is the main reason for its limited performance, which requires further improvement. In HEVC, PU and TU partitions are used for determining whether the CU should be split or nonsplit. It can be inferred that CU encoding time will be significantly reduced when the encoder directly measures the CU coding tree tailor. Accordingly, mechanisms based on low complexity CU encoding by combining early determination with other coding parameters have been successfully applied. A couple of works have focused on speeding up CU coding tree computing through early termination mechanisms [13][14][15][16][17]. Cen [14] proposed a fast CU depth decision mechanism by utilizing spatial correlations to achieve CU depth range determination. Song [15] presented an early merge mode decision method to avoid exhaustive mode checks for CU derived from recursive quad-tree partitioning. Some prominent researches [18][19] should be aware that more low complexity schemes continuously focus on CU coding tree structure prediction. Guo [19] proposed a fast CU size selection algorithm based on hierarchical quad-tree correlations, and the size of the current CU can be determined according to the subtree distributions of adjacent CUs. In general, most of the current researches aim to reduce coding tree complexity by searching a specific tree depth instead of traversing all depths. In other words, current approaches only focus on reducing the leaf nodes of the coding tree by lessening depth; which means that the CU coding tree continuous to computed from its maximum CU size. However, with polytropic video content and coding parameters, a best CU coding tree structure may have a variable maximum CU size and be searched by a scalable depth. CU coding tree structures that are very complex contributes a little to coding performance but wastes amounts of encoding time. In general, using befitting CU coding tree structures is the key to reduce computational complexity, while maintaining good coding performance. In order to further improve CU coding tree prediction with a thorough tailor, a new low complexity CU coding tree mechanism was designed in our work. The proposed method is designed to conform to optimal CU partition according to quantization parameter (QP) and content change (CC).

HEVC Coding Tree Structure
The hierarchical coding structure of HEVC is based on the coding tree structure of CU, as shown in Fig 1. Coding tree unit (CTU) is defined as a root node with a CU size of 64×64. The coding tree structure allows recursive splitting into four equally sized nodes, which starts from the CTU and stops when tree depth reaches the maximum. Maximum tree depth is defined as the maximum number of splits in the CU coding tree, and one split results from dividing a CU into partitions for prediction (to be described below). Thus, a CTU size of 64×64 and a maximum tree depth of four imply that the leaf node has a smallest CU size of 8×8. Therefore, coding tree partitioning allows a content adaptive coding tree structure comprised of CU64 (a block contained 64 × 64 pixels), CU32 (a block contained 32 × 32 pixels), CU16 (a block contained 16 × 16 pixels), CU08 (a block contained 8 × 8 pixels).
Each CU node is configured to use a particular prediction mode, which may be either intra prediction or inter prediction. Mode decision is a process that the encoder selects the optimal CU size for the current CTU by calculating coding costs J mode of various CU modes. The cost function for mode decision J mode is specified by where B mode , specifies the bit cost to be considered for mode decision, SSE represents the difference between two blocks with the same block size, and λ mode is the Lambda value that is used for cost computation. For each CU, the following mode decision process is conducted in the HEVC encoder. The J mode in each mode (MODE_INTER, MODE_SKIP, MODE_INTRA, PCM) is computed and J mode is set to minimum CU coding cost. Then, a check is performed to determine whether the best coding mode is MODE_SKIP (early CU condition). The bit cost B mode is updated by adding bits for the CU split flag and the minimum J mode is recomputed. If the condition is true, do not proceed to the recursive mode decision at a smaller CU size. Otherwise, proceed to the recursive mode decision at a smaller CU size when the current CU depth is not at the maximum. This CU level mode decision is recursively performed for each CU depth. Final optimal CU size is determined by determining whether the current CU should split or not on the account of the minimum J mode between J mode of the current CU size and the sum J mode of the four small sizes CUs. A CTU is divided into multiple CUs, as shown in Fig 2. The flexible coding tree structure of HEVC contributes a significant improvement in coding gain. However, it causes a dramatic increase in encoding complexity, because the encoding process of HEVC needs to explore every single CU size from 64×64 to 8×8; in which the best PU and TU partition must be determined for all possible PU and TU sizes. This exhaustive mode checks results in an enormous increase in computational complexity.

Motivations and Analyses
The goal of reducing encoding complexity would be realized by tailoring a complete CU coding tree to avoid exhaustive CU size checks. This part would provide guidance on the aspect of the motivation for CU coding tree tailor and analyses of CU distribution, respectively.

Motivations for CU Coding Tree Tailor
The CU coding tree structure has an effect on encoding complexity due to the exhaustive traversal of its nodes for optimal CU size. A rough way to figure out the relationship between CU coding tree structure complexity and encoding saving is by performing a complete CU coding tree tailor. As stated above, a CU coding tree is made up of a root node and tree depth; thus, our experiment compares various tailored CU coding tree structures to the HEVC complete CU coding tree structure with 64×64 CTU and depth = 4. CU coding tree complexity is decided by both depth and CTU size. The small depth and CTU set, the lower CU coding tree complexity is. Fig 3 shows the CU coding tree complexity by varies depths and same CTU. It can be inferred that CU coding tree complexity proportionally drops with decreasing depth. The realistic result comes from the experimental statistic in Table 1, and the theoretical result indicates that the CU coding tree complexity drops 25% when depth drops 1. The realistic result of CU coding tree complexity reduction is close to theoretical results, but not as good as these results; because most theoretical results do not consider fast CU mode determination (early CU condition). Without loss of generality, CU coding tree complexity decreases when depth reduced.
In addition to depth, CTU size is also a key to reduce CU coding tree complexity. As illustrated in Table 1, it can be observed that encoding time saving can be achieved up to 72.52% when the CU coding tree is a simple node and all CU sizes are 64×64; that is, the CU coding tree has a root node size of 64×64 and a depth of 1. Table 1 not only shows that a simpler CU coding tree structure can make a contribution in reducing encoding complexity, but also shows a trend that the larger the chosen CTU size and the more shallow the chosen depth are, the more time can be saved. However, attention should be given in an unavoidable problem, in which encoding accuracy severely drops when the CU coding tree structure is too simple. Thus, if the encoder contributes to saving encoding time while maintaining encoding accuracy, the CU coding tree tailor scheme should be built on a provision that whether some CU sizes with low probability can be tailored. In the next section, we will analyze factors affected by CU distribution.

Analyses of CU Distribution
In order to investigate general CU distribution law, statistics are obtained under the widest QP range, from 0 to 47.  sequence. It is shown that there is a high correlation between QP and the optimal CU distribution. It can be observed that CU sizes in a specific range occupy most percentages of the CU distribution in the same sequence. In general, the small QPs are highly likely to be encoded by the CU of small sizes. On the contrary, the large QPs are highly likely to be encoded by the CU of bigger sizes. In particular, the encoding condition is divided into three modes: lossless mode, visually lossless mode and lossy mode by different QP ranges. For lossless and visually lossless modes, a CU size of 64×64 has a quiet low probability; while a small CU size has a high probability, in which a CU size of 16×16 and 8×8 occupy 92.70% of the CU distribution under The analyses that CU distribution can be estimated are used to accelerate the CU encoding process. In detail, it is unnecessary to calculate all CU sizes from CU64 to CU08 in determining the optimal CU size. Encoders can only traverse a specific CU size range that occupies the most percentages of CU distribution probability. Other CU sizes outside of the specific range can be skipped, because these CU sizes are rarely used in the same sequence. Fig 4 concludes that various QPs have a serious impact on CU distribution. However, the HEVC encoder has a more complex QP setting mechanism. HEVC inter coding is performed with different QPs according to the temporal layer. For different coding configurations, the QP of each inter coded picture is derived by adding an offset to the QP of the intra coded picture (QPI) depending on the temporal layer. For example, if a base QP is set to the QPI for first layer, the QP of the second layer QPL2 = QPI + 2, the QP of the third layer QPL3 = QPI + 3, and so on. This configuration leads to a nonuniform J mode for different pictures by changing λ mode defined in Eq 1. The λ mode is the Lambda value used for cost computation by the equation: where W k , represents the weighting factor dependent to the QP offset hierarchy level of the current picture within a group of picture (GOP). This means that although a base QP has been set before encoding, each picture has its own QP; which depends on QP offset and results in a different J mode for coded pictures, not to mention the different CU distributions for coded pictures. Due to this reason, it is important to investigate the CU distribution law of each frame based on GOP structure. Table 2 illustrates the QP and QP offset in low delay and random access for experiment. Table 3 and Table 4 show the CU distribution probability within different GOPSizes under QP = 32. It can be observed that there is a relationship between CU distribution and GOP structure. Fig 5 shows the CU distribution probability of BQSquare sequence.
A CU size of 32×32 has the most probability in the first three pictures of a GOP. However, for the fourth picture, the probability of a CU size of 32×32 severely drops, while other CU sizes simultaneously increases markedly. It is noteworthy to mention that almost all GOPs follow the above rule. The reason is that different pictures have a same fixed QP offset in different GOPs, and the picture order count (POC) is also fixed in different GOPs. Hence, there is a similar CU distribution in adjacent GOPs, because the same POC has the same QP offset. Therefore, our first motivation for low complexity CU coding tree mechanism is that the N th frame and the (N + GOP size) th frame have similar CU distribution. Another situation that would stir up CU distribution is CC. The CC of a nature sequence usually introduces new objects that make the current picture and future picture different from the previous picture. In other words, the steady CU distribution rule related to QP is broken. Due to this reason, statistic for CC is obtained. Fig 6 shows the CU distribution probability of an encoded SlideShow sequence without QP offset. These 12 frames represent a typical CC situation. From the 5 th frame to the 11 th frame, content dynamically changes by adding numerous texts and color blocks. When focusing on CU size 64×64 and 8×8, the probability of CU size 64×64 drops from 63.75% to 22.50% and the probability of CU size 8×8 increases from 16.59% to 56.28%. Therefore, excluding the effect of QP changes, CU distribution changed due to CC. Thus, our second motivation for low complexity CU coding tree mechanism is, CU distribution needs to be updated, otherwise it may be distorted due to CC.

Proposed Low Complexity CU Coding Tree Mechanism
The first motivation discovers that the N th frame and the (N + GOP size) th frame have a similar CU distribution, which means that there is explicit CU distribution redundancy in adjacent GOPs. The proposed method tries to utilize this CU distribution redundancy to achieve the optimal CU coding tree tailor by establishing a probabilistic model that computes the CU distribution redundancy in a GOP. Hence the key technique is proposed as using the N th CU coding tree to predict the (N + GOP size) th CU coding tree. However, the probabilistic model is not invariable due to CC, according to the second motivation. The necessity of a probabilistic model update in the process of video coding is not only to maintain the accuracy of the probabilistic model, but also to avoid error propagation to the later GOPs. Based on this, the proposed low complexity cu coding tree mechanism studies how a probabilistic model is established and how a new CU coding tree is established by using the probabilistic model. Moreover the low complexity cu coding tree mechanism studies how a probabilistic model is recomputed and how often an update performed.

Establish CU coding tree Model
First, a probabilistic model is established by computing the CU distribution in a coded GOP. An important function F esta. is established for establishing CU coding tree model. F esta. (CU, where CU i refers to all kinds of CU sizes from 1 to m. P ij is a two-dimension matrix, in which each element is calculated as: where P is equal to a conditional probability that the frequency of a certain CU size within all CB (a block contained 4×4 pixels) in a frame. Next, a sign function F is calculated for CU coding tree tailor as follow: ( notice that δ ij is a two-dimension matrix that has the same size with F esta. and its elements contain only {0,1}. σ is used for deciding the lower limit of P ij . F is an important decider obtained from coded GOP and used to predict the probabilities of frames in later GOPs. Second, before the encoder uses a complete coding tree to obtain P(CU m , Frame n ) in a predicted GOP by traversing all CU sizes, a probabilistic model is established in advance by using F, to determine whether the probabilities of some CU sizes are equal to zero or not. With a similar form to F esta. , a function F predict used for CTPM predicting is designed. F pred. is performed by: . . . notice that P ij 0 is composed of {0, P 0 }, which means that the probability is zero or non-zero in the predicting frames. Ultimately, the F pred. is used for establishing a tailored CU coding tree for each frame in the predicted GOP. In order to obtain the new CU coding tree, the size of the root node and depth must be provided. Therefore the new coding tree is calculated according to CU i in F pred. for each frame by: size of root node is MaxSize(CU|Frame) and size of leaf node is MinSize(CU|Frame), depth is log 2 MaxSize(CU i |Frame j ) − log 2 MinSize(CU i |Frame j ) + 1.
The size of the new CU coding tree is less than or equal to the complete CU coding tree which means the encoder can obtain an approximate optimal CU partition result without searching all the CU sizes.

Update CU coding tree Model
In later GOPs, the encoder uses the recomputed F n esta: instead of the previous F nÀ1 esta: . However, the key to guaranteeing coding efficiency is how often perform CTPU. As we know that frequent updating will introduce unnecessary computational complexity, meanwhile infrequent updating perhaps makes CTPM inaccuracy. Our method proposes two rules to decide whether F n esta: is going to update or not eventually.
First, for the purpose to balance coding efficiency, firstly CTPU is performed based on frame rate parameter of each sequence. In generally, interval of CC is usually longer than one second. The frame rate parameter decides the frame numbers per second, namely the numbers of GOP per second. Therefore the CTPU period is defined as Second, although the CTPU period fixes the frame to be updated, the F n esta: should not replace F nÀ1 esta: in the case that F n esta: and F nÀ1 esta: have the similar CU distribution probabilities. Then performing CTPU depends on the matrix rank state which is usually measured for matrix equivalence between F n esta: and F nÀ1 esta: . Hence the CTPU is further defined as F n esta: ¼ F nÀ1 esta: ; RankðF n esta: Þ ¼ RankðF nÀ1 esta: Þ a Á F nÀ1 esta: þ ð1 À aÞ Á F n esta: ; RankðF n esta: Þ 6 ¼ RankðF nÀ1 esta: Þ ; a 2 ½0; 1 ð8Þ ( where Rank(Á) represents matrix rank. α is an equilibrium factor used for modify the performance of CTPU. So far, CTPM contributes to deciding which CU size is skipped before encoder traverses a complete CU coding tree. CTPU devotes to maintaining the predicted accuracy meanwhile restricting the increase of computational complexity as much as possible.

General Experimental Configuration
The performance of the proposed low complexity CU coding tree mechanism is demonstrated in this section. To validate the effectiveness of the proposed method, gain or loss is measured with respect to HEVC test model version 15.0. Coding efficiency is measured by various resolution sequences arranged from 2560×1600 to 416×240. All sequences were tested under the common test conditions of HEVC standardization [20]. Other configurations are set as CTU varies from 64×64 to 16×16, partition depth varies from 4 to 1. For the experiments, the default fast encoding tools in HM15.0 are set to be turned on as an anchor. This means that our proposed method is compared with the original HM15.0 encoder under its best speedup condition.
To fully evaluate the contribution of the proposed low complexity CU coding tree mechanism under different conditions, individual performances were measured according to different settings, as follows: (1) Experiment I, performance of the proposed method for lossy coding; and (2) Experiment II, performance of the proposed method for visually lossless coding and lossless coding. The Bjøntegaard delta peak signal-to noise ratio (BDPSNR) and the Bjøntegaard delta bit rate (BDBR) [21] were used to evaluate the performance of the proposed method with respect to HEVC. Encoding time saving is calculated according to the following equation: TSð%Þ ¼ Enc:timeðAnchorÞ À Enc:timeðProp:Þ Enc:timeðAnchorÞ Â 100 ð9Þ

Coding Performance Assessment
Experiment I is designed to verify the performance of the proposed low complexity CU coding tree mechanism for lossy coding. Experiment I has two parts: low delay condition and random access condition. Low delay applies to real-time communication, which has low delay request; while random access applies to support playback, video stream splicing, and so on. Table 5 shows the results of Experimental I for the proposed low complexity CU coding tree mechanism. Compared to HM15.0 best speedup condition, the proposed method yields a 28.67% average reduction in the total encoding time with a 1.39% average BDBR gain or 0.05 dB BDPSNR loss under low delay conditions, and achieves a 25.34% reduction in encoding time with a 1.18% BDBR gain or 0.04 dB BDPSNR loss under random access conditions. Particularly, the most achievement for encoding time saving is acquired from Class E under low delay. For the test sequences of Class E, the proposed method greatly saves total encoding time with negligible amounts of BDPSNR loss. This is because Class E has very large still background regions that exhibit little motion. Even though Class E performs the most complex CU coding tree, the probability of CU size 16×16 and 8×8 is quiet low. Therefore, the frame using the new CU coding tree without CU size 16×16 and 8×8 can achieve great coding performance as the anchor and markedly save encoding time. The proposed method is able to achieve considerable efficiency promotion with precise CU coding tree prediction. Experiment II is designed to verify the performance of the proposed method for lossless coding and visually lossless coding. Lossless and visually lossless compression is desired in many professional applications like medical imaging, surveillance systems, archiving systems, and so on. The specific configuration is shown in Table 6. Table 7 shows the results of Experiment II for the proposed low complexity CU coding tree mechanism. Compared to HM15.0 best speedup condition, the proposed method maintains a good quality reconstructed picture; and PSNR falls by barely 0.02 dB, which follows-up with a visually lossless image request. Bit rate increases by mere 0.82% and 0.46% for visually lossless coding and lossless coding, respectively, which keeps the HEVC high compression advantage. Most important of all, the proposed method achieves reducing encoding time up to 39.48% and 43.48% for visually lossless coding and lossless coding, respectively. On top of that, it can be seen that the proposed method is able to achieve better encoding complexity reduction to (visually) lossless coding than to lossy coding. The reason is the optimal CU distribution is either close to 100% or zero under low QPs, which makes the new coding tree simpler. Therefore, the proposed method gives a great approach for high quality image coding or video coding via low complexity.
The efficiency of optimal CU partitioning is described as the probability of obtaining the optimal CUs by performing a CU coding tree node once. For the anchor, the CU partition time is a constant determined by the number of complete CU coding tree nodes. The optimal CU must be one of the CU sizes of CU64, CU32, CU16 and CU08, and then the efficiency of the anchor is equal to 0.25. For the proposed method, the number of CU coding tree nodes is less than or equal to the anchor. Therefore, the proposed method obtains the optimal CU by performing less CU coding tree nodes. The efficiency of the proposed method most time overtops 0.25. Sometimes the proposed method cannot obtain the optimal CU partitioning result due to omit some of low probably CU coding tree nodes. This deviation is reflected on the BDPSNR drop and BDBR increase. In our experiment, the BDPSNR drops less than 0.05dB and BDBR increases less than 1.39%. However the deviation affects the coding efficiency slightly, which we consider it as the trade off on coding complexity reduction. The efficiency of the proposed method is better than the anchor in lossless coding and visually coding. The proposed method used a CU coding tree with a root node size of 32×32 and depth of 3 to achieve the similar optimal CU partition results, compared to the anchor. As a matter of fact, for most sequences, the proposed method rarely performs a CU size of 64×64 under low QPs in HM. Hence, CU partitioning results of the proposed method with a root node size of 32×32 and depth of 3 are very similar to the optimal CU partitioning results measured by a root node size of 64×64 and depth of 4. On the contrary, with high QPs, the proposed method chooses not to perform a CU size of 8×8 due to the large number of still and flat background blocks. Hence, the proposed method may loss partial details in lossy coding. For lossy coding, although the CU partitioning results do not contain a CU size of 8×8, the reconstructed picture has little difference compared to HEVC partitioning result. The reason is that high QP has a more severe impact on quality degeneration than CU partition without a CU size of 8×8. Generally speaking, the low complexity CU coding tree mechanism successfully reduces CU encoding complexity by accepting an adaptive CU coding tree structure for each frame, and avoids complex computing and large memory requests, which makes it convenient to generalize in pervasive applications.

Conclusions
In this paper, a novel low complexity CU coding tree mechanism is proposed for reducing HEVC encoding time. In order to deal with the high computational complexity caused by the complete traversal of the CU coding tree in HEVC, CU distribution is explored and a low complexity CU coding tree mechanism is proposed for optimizing the CU coding tree tailor. The best discovery is that CU distribution is related to QP and CC. Moreover, the proposed method is based on GOP to implement the prediction, which makes full use of CU distribution redundancy. All the experiment results show that for lossy coding, the proposed low complexity CU coding tree mechanism achieves a 27% average encoding time reduction. For lossless coding and visually lossless coding, it is possible for the proposed method to achieve a 42% encoding time reduction, while maintaining the high quality of the original picture. The proposed low complexity CU coding tree mechanism breaks through the original CU structure by avoiding low probability CU traversal and reducing unnecessary encoding time with the almost similar compression performance. Fundamentally the proposed method improves the performance of HEVC real-time encoding effectively and can be combined with other fast video coding techniques to accelerate encoding speed. In addition, the proposed method is willing to devote its applications for various conditions, in which computational resources are limited.