Transfer learning-based channel estimation in orthogonal frequency division multiplexing systems using data-nulling superimposed pilots

Data-nulling superimposed pilot (DNSP) effectively alleviates the superimposed interference of superimposed training (ST)-based channel estimation (CE) in orthogonal frequency division multiplexing (OFDM) systems, while facing the challenges of the estimation accuracy and computational complexity. By developing the promising solutions of deep learning (DL) in the physical layer of wireless communication, we fuse the DNSP and DL to tackle these challenges in this paper. Nevertheless, due to the changes of wireless scenarios, the model mismatch of DL leads to the performance degradation of CE, and thus faces the issue of network retraining. To address this issue, a lightweight transfer learning (TL) network is further proposed for the DL-based DNSP scheme, and thus structures a TL-based CE in OFDM systems. Specifically, based on the linear receiver, the least squares estimation is first employed to extract the initial features of CE. With the extracted features, we develop a convolutional neural network (CNN) to fuse the solutions of DL-based CE and the CE of DNSP. Finally, a lightweight TL network is constructed to address the model mismatch. To this end, a novel CE network for the DNSP scheme in OFDM systems is structured, which improves its estimation accuracy and alleviates the model mismatch. The experimental results show that in all signal-to-noise-ratio (SNR) regions, the proposed method achieves lower normalized mean squared error (NMSE) than the existing DNSP schemes with minimum mean square error (MMSE)-based CE. For example, when the SNR is 0 decibel (dB), the proposed scheme achieves similar NMSE as that of the MMSE-based CE scheme at 20 dB, thereby significantly improving the estimation accuracy of CE. In addition, relative to the existing schemes, the improvement of the proposed scheme presents its robustness against the impacts of parameter variations.


I. INTRODUCTION
O RTHOGONAL frequency division multiplexing (OFDM) has been widely applied in wireless communication systems, due to its attractive solution to combat multipath fading [1].To guarantee the reliable communication in OFDM systems, channel estimation (CE) plays a critical role [2] to eliminate the impact of wireless channels, and thus inspires many CE methods, e.g., non-pilotaided CE [3], [4] and pilot-aided CE [5].Without employing the pilot sequence (PS), the non-pilot-aided CE saves the valuable bandwidth resources [6].Yet the high computational complexity hinders its applications [7], [8].With employed PS, the pilot-aided CE achieves the high estimation accuracy, and thus is favored.However, the PS in pilot-aided CE severely occupies the valuable bandwidth resources [9], degrading the spectral efficiency of OFDM systems.To avoid bandwidth resource occupation, superimposed training (ST)-based CE was proposed.In ST-based CE, the available PS is superimposed on the data sequence [10], and thus saves the bandwidth resources, facilitating its practical applications [11].
Nevertheless, the superimposition between the data sequence and PS introduces the superimposed interference, which seriously affects the performance of CE and subsequent signal detection at the receiver.To suppress the superimposed interference, the variants of ST-based CE are triggered.In these variants, data-nulling superimposed pilot (DNSP) [12] is paid much attention due to its superiority in interference avoidance.The interference between data sequence and PS is avoided by arranging the PS in the frequency bins of datanull [12].Regretfully, some information-bearing data at certain frequency bins in DNSP are removed prior to transmission, resulting in the symbol misidentification [13].To tackle the issue of the symbol misidentification, the classic estimation algorithms such as minimum mean square error (MMSE) and maximum likelihood are introduced into DNSP [14].Yet they are still facing many challenges, such as the availability for the second-order statistics of the channel and noise [15], the low estimation accuracy, and the high computational complexity [16].
To improve the estimation accuracy and reduce the computational complexity, deep learning (DL)-based CE schemes have been proposed in recent years [17].In [18], two deep neural networks (DNNs) were designed to refine the channel state information (CSI) accuracy and improve the performance of data detection, respectively.Furthermore, convolutional neural network (CNN) is also commonly used in CE [19], which improves the accuracy of CE with reduced computational complexity.However, DL-based CE for OFDM system with DNSP has not been investigated, leaving a huge blank for avoiding the occupation of bandwidth resource.More importantly, these DL-based CE schemes, e.g., [20], [21], encounter serious model mismatch.When a network model is trained at one base station but then used at another base station (new environment), the network model usually needs to be retrained [22].In order to avoid retraining the network model, transfer learning (TL) has been developed in wireless communication systems, e.g, signal detection [23], CE [24] and CSI feed back [25], [26], etc.As one of the effective options, TL-based CE is a promising solution for the OFDM systems.
Motivated by DNSP, DL, and TL, we fuse the DL-based CE and the DNSP to improve the estimation accuracy and computational complexity, and develop a lightweight TL network to enhance network generalization.To our best knowledge, the TL-based CE for DNSP scheme in OFDM systems has not been investigated.The motivation of this paper is mainly arisen due to the following considerations: 1) Although the ST-based CE saves valuable spectrum resources, its severely superimposed interference needs to be alleviated, triggering us to employ DNSP.
2) The challenges of the estimation accuracy and computational complexity need to be tackled for the CE in DNSP, which promotes us to fuse the DNSP and DL.
To this end, the estimation accuracy is improved and the computational complexity is reduced, thereby alleviating the symbol misidentification in DNSP.
3) The developed CE for DNSP scheme should possess the robustness against environmental changes and is not complex.Thus, the dilemmas of network retraining and computational complexity should be solved, impelling us to develop lightweight TL network.With the slightly increased computational complexity, the trained network has a good generalization ability.

A. Related Works
In this work, we investigate the TL-based CE for OFDM systems with DNSP.The related works include ST-based CE, DL-based CE, and the TL applications in wireless communication.We respectively review these works as follows.
Relative to ST scheme in [27], DNSP scheme not only improves the CE, but also improves the data detection performance [28].However, the DNSP discards the partial information of transmitted data symbols, causing the symbol misidentification.In other words, the CE accuracy of DNSP scheme is obtained at the cost of degradation of data detection performance.Thus, the works in [13], [29]- [36] are proposed to tackle this issue.In [13], a partially data-dependent ST scheme was proposed to reduce symbol misidentification by researching the trade-off between interference cancellation and frequency integrity.A partial-data ST was applied to OFDM in [29], in which an interference control factor is assigned for the training sequence.An enhanced scheme of [29] was proposed in [30], which shifted the positions of PS and superimposed them onto the data symbols with the lowest power sum.In [31], authors raised a data detection approach by using an iteration approach with kernel weighted least squares (LS).Due to the increase of computational complexity, the performance of CE and data detection has been improved.[32] investigated a data coding scheme to relieve the symbol misidentification.[33] utilized a constellation rotation scheme to preserve the partial symbol information discarded by the superimposed scheme.By considering the signal sub-space technique, [34] theoretically analyzed that the symbol misidentification was related to both the modulation scheme and the pilot pattern.[35] and [36] respectively investigated the precoding-based approaches which can effectively retrieve the discarded symbol information.Although the symbol misidentification was investigated in [13], [29]- [36], the estimation accuracy and computational complexity of CE remain challenges due to the mode of PS, which has not been well tackled.
Recently, DL-based CE is proposed to improve the estimation accuracy [37].In [22], an end-to-end approach mode was applied to OFDM system using DL.The DNN is regarded as a black box for direct signal recovery.[38] constructed a PS designer using two layer neural networks and a channel estimator using DNNs, which were jointly trained to minimize the mean square error (MSE) of CE.Based on [38], the accuracy of CE was further enhanced by using another DNN in an iterative manner [39].In addition, [40] exploited the DNN to investigate CE for doubly selective fading channels.[41] introduced a residual learning based DNN for CE, in which the computation cost is greatly reduced.Along with the use of DNN, CNN is also commonly used for CE.In [19], two CNNs were utilized to extract the coarse features and refine the CSI accuracy, respectively.[42] introduce a federated learning based framework for downlink channel-CSI prediction, which updates the global model twice by considering the local model weights and the local gradients, respectively.Inspired by CNN, a graph neural network was constructed in [43] to improve the performance of CE by extracting the correlations of the CSI.Nevertheless, these DL-based CEs still face many challenges, such as low generalization with the environment change [44], long training time, complex parameter tuning, and large memory requirements [45], etc. Relative to the general DL method, the TL features many advantages [24], [46], e.g., huge amount of data is not required, the training time is short, and the network effectively adapts to the new environment without network retraining, etc.In [24], the TL approach was exploited to speed up new environment adaptation in lowresolution multiple-input multiple-output systems.By using direct transfer, the TL-based CE was designed in [24] to adapt the migration from one environment to another, while still encountering high computational complexity.
Although these DL-based methods [47] improve the CE accuracy and computational complexity, the DL-based CE for DNSP scheme has not been investigated and faces the challenge of network retraining.The limited TL-based CEs (e.g., [24]) show a good perspective to improve CE generalization.Yet the network architectures need to be lightweight to reduce computational complexity, and thus facilitate their practical deployments.To the best of our knowledge, the TL-based CE for DNSP scheme has not been investigated as well.To remedy the deficiencies of related works (i.e., DNSP, DL and TL) and continue their advantages, we investigate TL-based CE for DNSP scheme in OFDM systems.

B. Contributions
The main contributions of this paper are summarized as follows.
1) We develop a framework for fusing DL-based CE and DNSP.To our best knowledge, the DL-based CE for DNSP scheme has not been investigated.We employ DNSP to improve the spectral efficiency, while integrate DL network to enhance estimation accuracy.Our perspective of estimation accuracy improvement is to capture the linear and nonlinear solutions, in which the initial feature for CE is first extracted by the LS-based linear estimator followed by a nonlinear DL network.With the developed framework, both the spectral efficiency and estimation accuracy are improved.
2) We introduce the lightweight TL network for DL-based CE with DNSP.From the existing investigations [24], the model mismatch (i.e., the trained network cannot adapt to the change of transmission environment) has not been well addressed by DL-based CE.With the consideration of DNSP scheme, this situation is further worsened.We introduce the lightweight TL network to tackle this issue.Although the lightweight TL network is employed, it substantiality tackles the issue of model mismatch for CE.Especially, the computational complexity and online training time are slightly increased, facilitating its practical applications.3) We develop a novel ReCNN to improve the estimation accuracy for the CE with DNSP scheme.The developed ReCNN employs CNN to capture a solution of DL-based CE, and utilizes the lightweight TL network to enhance model generalization.Thus, not only the improvement of CE accuracy is achieved by DL-based mode, but also the issue of model mismatch is addressed by TL approach without significant increase of computational complexity.This network architecture provides a good paradigm for the transmission environment adaptability of CE.
The remainder of this paper is structured as follows: In Section II, we describe the system model.The TL-based CE using DNSP method is presented in Section III, followed by the experimental results and analysis are illustrated in Section IV.Finally, Section V concludes our work.
Notations: Bold face upper case and lower case letters denote matrix and vector respectively.T is firstly precoded by an unitary matrix W ∈ R N ×N , e.g., Walsh Hadamard matrix, to hold the orthogonality between the PS and the data sequence.Then, the modulated signal is nulled at equidistant positions for inserting pilots [12].For convenience, a diagonal matrix J ∈ R N ×N is utilized for the nulled positions of screening, its diagonal entries are where Q is the spacing between pilots with Q = N/P .Then, the transmitted signal X can be expressed as where ρ ∈ [0, 1] stands for the power proportional coefficient, E represents the transmitting power, and c ∈ C N ×1 denotes the training sequence.Then, the transmitted signal X is transformed to time domain by using an inverse discrete Fourier transform and added by a sufficient cyclic prefix (CP).At the receiver, after CP removal and discrete Fourier transform, the received signal T is written as where H = diag (h 0 , h 1 , . . ., h N −1 ) with diagonal entries being the frequency response of the quasi-static frequency selective fading channel, n ∈ C N ×1 denotes the circularly symmetric complex Gaussian (CSCG) noise with mean zero and variance σ 2 n .With the received signal Y, the low-complexity LS estimation is first employed to extract the initial features of CE.With the extracted features, a CNN is developed to enhance the CE, and thus structures the framework for fusing DL-based CE and DNSP.Followed by a lightweight TL network, the developed CNN further forms the ReCNN to enhance model generalization.According to the CE produced by ReCNN, we perform the equalization and detection to recovery the transmitted signal.The details of the TL-based CE are elaborated in the next section.

III. TRANSFER LEARNING-BASED CHANNEL ESTIMATION
As some information-bearing data are removed in DNSP scheme, which causes the symbol misidentification at the receiver.Meanwhile, the performance of DL-based CE is seriously degraded by the influence of varying communication environment, resulting in the need to retrain the network model.To conquer these challenges, the TL is introduced into CE by using DNSP.In the following subsections, we first introduce the TL, and then the TL-based CE is elaborated.

A. Transfer Learning
From [25], the same type of environment is divided into several different regions.Let X and Y denote the space of the channels in the different regions, respectively.Then, the definitions of the "domain" and the "task" are given in the following four definitions [48]: Definition 1: The "domain" D is composed of the feature space X and the marginal probability distribution P (h |h LS ) , i.e., D = {X , P (h |h LS ) }, where h is the real CSI and h LS is the estimated CSI by using the LS algorithm, respectively.Meanwhile, the "task" T is defined as the prediction of the target channels from the source channels.Given the specific domain D, the "task" T is composed of the label space Y and the prediction function F, i.e., T = {Y, F}.The prediction function F can be learned from the training data of the source region and then be used to predict the target channels in the target region.
Classical TL consists of two aspects, namely, the source domain transfer and the target domain adaption.Based on [48], the definition of TL can be given as follows: Definition 2: Given the source task T S , the source domain D S , the target task T T , and the target domain D T , the aim of TL is to improve the performance of the target task T T by using the knowledge from T S and D S , where Here we extend the single-source domain transfer to the multi-source domain transfer.Then, a more generalized definition of TL can be provided as follows: Based on Definition 3 and Definition 4, the target region channel prediction for OFDM systems can be formulated as a typical TL problem, where the k-th learning task is to predict the target channel from the source channel in the k-th region.

B. TL-Based CE
To enhance the estimation accuracy and model generalization for the CE of DNSP scheme, a CNN is first developed followed by a lightweight TL network, and thus the ReCNN is structured.Although the ReCNN includes the DL and TL networks, the ReCNN-based CE is referred to as TL-based CE in this paper to highlight its transfer characteristics.By using the TL-based CE, not only the requirement of secondorder statistics about the channel and noise is avoided, but also the adaptability of CE against varying communication environments is improved.In this sub-section, the model structure of ReCNN is briefly described, and then model training is illustrated in detail.
1) Model Structure: In this paper, we fuse the LS estimator and DL-based CE for the DNSP scheme to capture the NN and non-NN solutions, respectively.To further address the model mismatch of DL-based CE, a lightweight TL network which is inspired by the DnCNN in [19] and named as ReCNN, is proposed in this paper.The modified structure is shown in Fig 2 .Therein, the ReCNN consists of fourteen convolutional layers and two fully connected layers.The convolutional layers are used for extracting features from input.Then, the fully connected layers learn the non-linear combinations of these extracted features to further improve the performance of the task.In addition, the residual block between the input layer and the last convolutional layer utilizes the CNN with a subtraction structure to learn the residual noise from the noisy channel matrix for denoising.The corresponding hyperparameters are introduced in Table I, where we first set these hyperparameters via the ReCNN structure and then fine-tune these hyperparameters to search the appropriate values for the network empirically.
2 • LS channel estimation: LS-based CE is utilized to extract an initial feature of CE in the frequency domain [49], which also serves as the data sets (including the training set, validation set and testing set) for the ReCNN.
According to the LS algorithm [12], the frequency do- main channel estimator H P ∈ C P ×1 is given as where Y (i) and c(i), i = 0, . . ., N − 1, are the frequency domain values of the received signal Y and the PS c, respectively.Then, we transform H P into time domain, and obtain the CE h ∈ C P ×1 in time domain, i.e., By adding zero at the end of h, a vector h with length N is formed, i.e., h =   h T , 0, . . ., 0 Then, the LS estimation H LS is obtained by transforming h into the frequency domain, i.e., • Data collection: For the training of the ReCNN, we continuously collect M time slots as one training sample, then its training set is defined as {D} = H LS, H Label .Considering the quasi-static frequency selective fading channel [12], h is generated according to the widely adopted channel model COST2100 [50] without loss of generality, where the different outdoor semi-urban scenarios at the 300MHz band are considered.In other words, the number of clusters in each environment are randomly distributed rather than fixed.Then, h is transformed into the frequency domain to form label H Label , and we map the complex-valued H Label to real-valued H Label .
After we obtain the received signal Y from (3), H LS is generated according to (4)- (7).Finally, the complexvalued set of { H LS } is reshaped to the real-valued set { H LS }.
Considering TL-based CE in this paper, we use the training set {D S (k)} Ks k=1 to represent the source domain dataset, and the {D T } to represent the target domain dataset.It is worth noting that the two datasets are both generated in source regions and target region of the COST2100 channel model environment, respectively.In addition, to validate the trained network parameters during the training phase, a validation set is also generated by using the same generation method of training set, and thus we can capture a set of optimized network parameters.
• Pre-training: We denote the whole network parameters as Θ= {Θ pre , Θ fin } of which the Θ pre and Θ fin are the sets of parameter values for Pre-training and Finingtraining phase, respectively.From Algorithm 1, the source domain dataset {D S } generated by K s source tasks is utilized to train the ReCNN.To extract the time correlation, we have collected the data of M time slots continuously.That is, the input of the ReCNN is the CSI values matrix H LS and output is the estimated channel matrix, which is denoted as where F is the pre-training function and Θ pre is the pretraining parameter to be updated by training.The total loss function of the network is the MSE between the estimated and the actual channel responses calculated as follows: where T denotes the number of training samples.After the Θ pre is obtained via G pre steps of ADAM updating, we use the target domain dataset {D T } to test.Meanwhile, the corresponding normalized mean square error (NMSE) is saved accordingly.9)) to operate the fine-tuning phase.After the fine-tuning is finished by G fin steps updating, the ReCNN parameters Θ fin are optimized.Finally, the testing set {D TTe } is used to predict the CSI of the target environment.

IV. EXPERIMENTAL ANALYSIS
In this section, numerical results of the proposed TL-based CE using DNSP are given.First, basic parameters and definitions involved in the simulations are given.Then, the CE's NMSE and the detection's bit error rate (BER) of the proposed scheme are shown to verify the effectiveness of the proposed TL-based CE.Finally, we discuss the robustness of the proposed scheme against the influence of different parameters.The source code is available at https://github.com/Leiunnn/TransferLearningBasedCEbyDNSP.git.

A. Parameter Setting
In the experiments, the following basic parameters are applied unless otherwise specified.N = 256, P = 8 [12].The channel h is generated by channel model COST2100 [50] at the 300MHz band, the number of multi-path is set as L = 8, and we collect the data of M = 16 time slots continuously [19].The transmitted data symbol s is modulated by QPSK modulation.The data sets {D S } and {D T } have 8,000 and 500 samples, respectively.For both of them, the batch sizes are set as 20 samples.In pre-training phase, the data set {D S } is divided into training set and validation set with the sizes of 6000 and 2000, respectively.Only 500 samples are employed for the {D T }, in which 300 and 200 samples are respectively allocated to {D TTr } and {D TTe }.We use Adam optimizer as the training optimization algorithm [51] with parameters β 1 = 0.99 and β 2 = 0.999 [52].Both the learning rates of the two phases are all set to 0.0001, and the signal-to-noise-ratio (SNR) in decibel (dB) is defined as where E is the transmitted power of X, which is equal to the summation of data-symbol power E s and training-sequence power E c .In these simulations, E s = (1 − ρ) E and E c = ρE, where ρ = 0.2.For network training, the mixed SNR is adopted, i.e., each training sample is generated under a random SNR from SNR = 0dB to SNR = 35dB with the interval of 5dB.
The NMSE is utilized to evaluate the CE performance, and defined as [53] For the convenience of expression, the simplified expressions in the simulations are given as follows.
• "Proposed+ZF SD", "Proposed+MMSE SD", "No Transfer+ZF SD", and "No Transfer+MMSE SD" stand for the "proposed transfer learning channel estimation followed by ZF equalization", "proposed transfer learning channel estimation followed by MMSE equalization", "without transfer learning channel estimation followed by ZF equalization", and "without transfer learning channel estimation followed by MMSE equalization", respectively.

B. CE and Symbol Detection (SD) Performance
To validate the effectiveness of proposed TL-based CE using DNSP, the performances of CE and SD under different SNRs are illustrated in  II for the convenience of comparison.From Fig 3, the NMSE of [29] is higher than that of [12].The reason is the partially data-dependent ST scheme is employed in [29], which superimposes the data sequence on the transmitted PS and thus introduces the superimposed interference into its CE.It could be observed that the NMSE of "No Transfer" is smaller than that of "LS CE [12]" and "LS CE [29]".Moreover, from Table II, the NMSE of "No Transfer" is smaller than that of "MMSE CE [12]" in the relatively low SNR region (e.g., SNR≤10dB).This embodies that ReCNN can still extract certain CSI features in the case where the environment is changed.From Fig 3 and Table II, the NMSE of "Proposed" reaches the minimum in all SNR region, even compared with the "MMSE CE [12]".For example, the NMSE of "Proposed" is 4.9 × 10 −3 for the case of SNR=20dB, while the NMSE of "MMSE CE [12]" is 1 × 10 −1 .This reflects that the "Proposed" obtains higher CE accuracy than "MMSE CE [12]", and thus it can work well in the varied environment.Thus the Proposed scheme possesses its effectiveness to improve the NMSE of CE.
Since the PS c is superimposed on the modulated data symbol s, it needs to be verified whether the superimposed interference (from the ST) degrades the detection performance of data symbols.In this paper, the BER is used to measure the detection performance and is plotted in  [12], [29], and the proposed scheme.With the same CE methods and equalization methods, the BER of [12] is lower than that of [29] due to the influence of CE.Meanwhile, both the "Proposed+ZF SD" and "Proposed+MMSE SD" achieve the smallest BER by using the same equalization method for all given SNRs.For the case where SNR=20dB, the BER of "Proposed+MMSE SD" is less than 1.8 × 10 −3 while the BER of "MMSE CE+MMSE SD [12]" is about 1.3 × 10 −2 .This verifies that the proposed CE scheme improves the BER performance as well.
As a whole, compared with the "LS CE [12]", "LS CE [29]", "No Transfer", "MMSE CE [29]", and "MMSE CE [12]", both the performances of CE and SD in the new envi-   ronments are improved by "Proposed".Especially, compared with the "MMSE CE [12]", the "Proposed" can achieve the lower NMSE without the second-order statistics about the channel and noise.Meanwhile, based on the "Proposed", the LS/MMSE equalization obtains the smallest BER due to the improvement of CE.

C. Analysis of Parameter Impact
In this subsection, the robustness of the proposed scheme against parameter variation is analysed.The impact of pilot number P is first discussed, followed by the superposition factor ρ. It is worth noting that, besides the change of the impact parameters (i.e, P and ρ), other basic parameters remain the same as those given in CE and SD Performance during the simulations.
1) Impact of P : The NMSE of CE and the BER of SD are usually impacted by the number of pilot (i.e., P ).To reveal the robustness of the proposed CE scheme against the impact of P , the NMSE of CE and the BER of SD are given in Since the "LS CE [12]" and "MMSE CE [12]" achieve smaller NMSEs than those of [29] with the same CE method,  we only employ [12] as a comparison when discussing the parameter P .In  [12]", and "MMSE CE [12]" decline with the enlargement of the pilot number P .That is, the more accurate CSI can be achieved by the conventional CE schemes.In addition, for each given value P , the of "Proposed" achieves the smallest NMSE for all given SNRs.From Fig 5 , when SNR=20dB P = 4, the NMSE of "MMSE CE [12]" is higher than 2 × 10 −1 , while the NMSE of "Proposed" is lower than 3 × 10 −2 .This reflects that the proposed scheme improves the NMSE compared with the existing methods with the variations of P .Meanwhile, it could be observed that proposed scheme has its robustness against the varying P .
Fig 6 gives the BER performance with the different equalization methods against the impact of P .From Fig 6, the varying of BER is not regular.The reason is that the performance of SD is affected by both P and the performance of CE.Thus, to achieve a lower BER, the pilot number should be trade off in the proposed scheme.Nevertheless, for each given P , the BERs of "Proposed+ZF SD" and "Proposed+MMSE SD" obtain smaller BERs than those of [12] with the same SD methods.For example, for the cases where SNR=20dB and P = 8, the BER of "Proposed + MMSE SD" is about 1.6 × 10 −3 , while the BER of "MMSE CE+MMSE SD [12]" is larger than 1×10 −2 .This validates that the proposed scheme improves the SD performance and has its robustness against the varying of P .
On a whole, from  [12].With the varying P , the proposed scheme still improves the performance of CE and SD, and thus possesses its robustness.
Similar to the reason of the simulation against parameter P , we only compare the performance of [12].From Fig 7, the CE's NMSEs of "Proposed", "LS CE [12]", and "MMSE CE [12]" decrease with the enlargement of the power proportional coefficient ρ.Although the decline of NMSE is not obvious when compared ρ = 0.3 with ρ = 0.2, the tendency of decreasing is still observed.The reason for this phenomenon is that the performance of CE is improved due to the increased pilot power.Moreover, the larger the pilot power employed, the higher accuracy the CE obtained.Besides, for the cases where ρ = 0.1, ρ = 0.2, and ρ = 0.3, the "Proposed" obtains a smaller NMSE than those of "LS CE [12]" and "MMSE CE [12]".For example, for the cases where SNR=20dB and ρ = 0.1, the NMSE of "MMSE CE [12]" is higher than 1 × 10 −1 , while the NMSE of "Proposed" is lower than 1 × 10 −2 .This reflects that the "Proposed" reduces the NMSE of CE against the varying ρ, and thus possesses its robustness against the impact of ρ.
In Fig 8, the BERs of SD are obtained based on three CE methods, i.e., "Proposed", "LS CE [12]", and "MMSE CE [12]".From Fig 8, with the increase of ρ, the BERs of the different methods do not change greatly.Thus, although the power factor ρ influences the performance of CE, it has a little impact on SD due to the superimposed pilots on the datanulling in DNSP scheme.Even so, the significant improvement of BER performance is still obtained.For each given ρ, the BERs of "Proposed+ZF SD" and "Proposed+MMSE SD" obtain smaller BERs than those of SD methods in [12].From Fig 8, when SNR=20dB and ρ = 0.2, the BER of "Proposed+MMSE SD" is lower than 2 × 10 −3 , while the BER of "MMSE CE+MMSE SD [12]" is about 1.2 × 10 −2 .This indicates that the proposed scheme improves the SD performance even against the varying of ρ.
To sum up, the NMSEs and BERs of the proposed scheme show superiority over those of [12] in Fig 7 and Fig 8, respectively.Against the impact of ρ, the proposed scheme can effectively reduce the NMSE of CE and the BER of SD to possess its robustness.

V. CONCLUSION
In this paper, the TL-based CE in OFDM systems by using DNSP scheme has been investigated and thus forms a novel network ReCNN.In the proposed scheme, the employed CNN improves the accuracy of DL-based CE by fusing the linear and nonlinear solutions.With the lightweight TL network, the CE generalization is enhanced.To this end, not only the improvement of CE accuracy is achieved by DL-based mode, but also the issue of model mismatch is addressed by TL approach without significant increase of computational complexity.Compared with existing DNSP schemes with MMSE-based CE, the proposed scheme obtains the lower NMSE and BER without the requirements of second-order statistics about the channel and noise.With environmental changes, the CE generalization is also validated.The proposed scheme presents a good estimation accuracy and model generalization, promoting the existing researches of DL-based CE move towards practical application.In future works, we will investigate the online learning-based CE for the DNSP scheme in OFDM systems.
(•) T , (•) H , denote the transpose and conjugate transpose, respectively.• represents the floor operation.diag(•) is the diagonalization operation of matrix.I represents an N × N identity matrix.• 2 is the Euclidean norm.II.SYSTEM MODEL In this paper, we consider an OFDM system with N subcarriers and P pilots by using DNSP.As shown in Fig 1, the modulated signal s = [s 0 , s 1 , . . ., s N −1 ]

Definition 3 :Definition 4 :
Let B T and B k represent respectively the set in the target and the k-th source given environments, where k = 1, . . ., K s , with K s being the number of source tasks.Given the source tasks {T S (k)} Ks k=1 , the source domains {D S (k)} Ks k=1 , the target task T T , and the target domain D T , the aim of TL is to improve the performance of the target task T T by using the knowledge from {T S (k)} Ks k=1 and {D S (k)} Ks k=1 , where D T = D S (k) or T T = T S (k).In Definition 3, the condition D T = D S (k) means that either the corresponding feature space X T = X S (k) holds or the corresponding marginal probability distribution P T (h |h LS ) | h∈BT = P S(k) (h |h LS ) | h∈B k holds.The condition T T = T S (k) means that either the label space Y T = Y S (k) holds or the corresponding conditional probability distribution P T (h |h LS ) | h∈BT = P S(k) (h |h LS ) | h∈B k holds.Since the conditional probability distributions for different prediction tasks are different, the condition T T = T S (k) is satisfied.Therefore, the target region channel prediction for the OFDM systems can be formulated as a TL problem.In this paper, TL transfers the knowledge using the ReCNN, which is defined as follows: Given a TL task described by {T S (k)} Ks k=1 , {D S (k)} Ks k=1 , T T , D T , it is a TL task when the prediction function F T of T T is a non-linear function that is approximated by the ReCNN network.

)
Model training: Before introducing the ReCNN network training, LS-based CE is first described and then followed by the data collection.At last, the proposed ReCNN-based TL training, i.e., pre-training and fine-tuning are explained.The model training details are summarized in Fig 2.

Fig 3 and
Fig 4, respectively.The NMSE curves of different CE methods are compared in Fig 3, and partial numerical results are presented in Table
Fig 4. Two conventional equalization methods, i.e., ZF equalization and MMSE equalization, are utilized to equalize wireless channel.In Fig 4, we compare the BERs among those of

Fig. 6 .
Fig.6.BER of SD against the impact of P , where P = 4, P = 8 and P = 16 are considered, respectively.

Fig 5 and
Fig 6, the proposed scheme reaches lower NMSEs and BERs than those of The source tasks {T S (k)} Ks k=1 , target tasks T T , pre-training learning rate γ 1 , fine-tuning learning rate γ 2 , batch size V , number of gradsteps for pre-training G pre , and number of gradsteps for fine-tuning G fin Output: The pre-trained network parameter Θ pre , and the estimated CSI based on TL H. 1 Pre-training stage 2 Randomly initialize the network parameters Θ pre 3 Generate the training dataset {D S } ∈ {D S (k)} Randomly select V training samples from {D S } as the training batch {D STrB } Load the trained parameters Θ pre and generate the testing dataset {D T } 10 Predict the target channel in the target given environment base on Θ pre and {D T } using (8) 11 Fune-tuning stage 12 Load the pre-trained network parameters Θ pre 13 Generate the fine-tuning dataset {D T }, and then divide {D T } into {D TTr } and {D TTe } 14 for t = 1, ... , G fin do fin 18 end 19 Transfer testing stage 20 Predict the CSI of target given environment base on {D TTe } and parameters Θ fin .set of the target domain.First, we divide the dataset {D T } into {D TTr } and {D TTe }.Then, the pre-trained network parameters Θ pre are loaded into the ReCNN.It is worth noting that we need to freeze the parameters of convolutional layers and only update the parameters of the fully connected layers by using the backpropagation algorithm.In addition, compared with DL-based network training, much fewer samples and shorter training time are needed in the fine-tuning phase.Similar to the pretraining phase, we utilize the same loss function (given in ( • Fine-tuning: After the pre-training is performed according to Algorithm 1, we need to build the training Algorithm 1: Transfer learning for channel estimation Input: 15 Load the network parameters Θ pre → Θ fin 16 Randomly select V training samples from {D TTr } as the training batch {D TTrB } 17 Update Θ fin by using the ADAM algorithm (learning rate γ 2 ) to minimize L