Investigation of Cross-Contamination and Misidentification of 278 Widely Used Tumor Cell Lines

In recent years, biological research involving human cell lines has been rapidly developing in China. However, some of the cell lines are not authenticated before use. Therefore, misidentified and/or cross-contaminated cell lines are unfortunately commonplace. In this study, we present a comprehensive investigation of cross-contamination and misidentification for a panel of 278 cell lines from 28 institutes in China by using short tandem repeat profiling method. By comparing the DNA profiles with the cell bank databases of ATCC and DSMZ, a total of 46.0% (128/278) cases with cross-contamination/misidentification were uncovered coming from 22 institutes. Notably, 73.2% (52 out of 71) of the cell lines established by the Chinese researchers were misidentified and accounted for 40.6% of total misidentification (52/128). Further, 67.3% (35/52) of the misidentified cell lines established in laboratories of China were HeLa cells or a possible hybrid of HeLa with another kind of cell line. Furthermore, the bile duct cancer cell line HCCC-9810 and degenerative lung cancer Calu-6 exhibited 88.9% match in the ATCC database (9-loci), indicating that they were from the same origin. However, when we used 21-loci to compare these two cell lines with the same algorithm, the percent match was only 48.2%, indicating that these two cell lines were different. The SNP profiles of HCCC-9810 and Calu-6 also revealed that they were different cell lines. 150 cell lines with unique profiles demonstrated a wide range of in vitro phenotypes. This panel of 150 genomically validated cancer cell lines represents a valuable resource for the cancer research community and will advance our understanding of the disease by providing a standard reference for cell lines that can be used for biological as well as preclinical studies.


Introduction
Although cell line authentication has been widely recommended for many years, misidentification, including cross-contamination, remains an unresolved issue [1,2]. With the development of biomedical research, more and more scientific reports involving human cell lines have been published by researchers in China. However, authentication has been neglected by many researchers [3]. Until recently, many major journals and research agencies recommended that a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 cell lines should be verified for authenticity before publication or inclusion in grant applications. Although there are many new methods for cell line authentication [4,5], DNA profiling based on short tandem repeat (STR) is still proposed as a gold standard method. Our lab started using STR for cell line authentication from 2009 with 16/20-loci STR and found about 25% cross-contamination or misidentification among 380 cell lines from 2009 to 2013 [6]. In this study, we provided 21-loci STR for 278 cancer cell lines (data collected from 2014 to 2016), in accordance with the major cell repository' recommendation for the testing of cell line identity. We observed examples of cross-contamination and misidentification of cell lines, and provided reference STR profiles for cancer cell lines that are currently unavailable in the STR database of major cell repositories. Additionally, we have extended the number of genetic loci for STR profiles currently available in public databases.

Results
Incidence and extent of cell line cross-contaminants DNA from 278 cell lines was analyzed by STR profiling (S2 Table). By performing sensitivity evaluation assay for cross-contamination of intraspecies in the cell lines, we found that over 5% of the cell lines presented cross-contamination of intraspecies (Table 1). 20 samples (20/ 278 = 7.2%) that failed to be amplified by PCR were confirmed to be non-human origin. The success rate for complete analysis was 258/278 (92.8%). 150 cell lines were found to be unique (containing only one type of cells). The verified cell lines were derived from 20 different tissues (Table 2). Altogether, 128 instances of cross-contamination were confirmed, with an incidence of 46.0% (128/278). 84.4% (108/128) of cross-contamination was intra-species while 20 samples failed to be amplified by PCR and confirmed to be inter-species cross-contamination. We chose two of non-human cells to confirm the origin by species identification analysis [7], and confirmed that HIBEC cells were originally from rat and C201441 cells were from mouse (see S1 Fig). There were 14 cell lines for which we could not find any original documents about the cell line background, including C2013103, C201441, C201439, KI-H6, K324, A533, C201535, CB-06, D739, BC-034, CHMAS, PC-9, Anglne and CCLP-1. According to the original description, we divided the remaining cell lines (264 cell lines) into two categories: the Chinese model (cell lines established in China, n = 71) and the non-Chinese model (cell lines established outside China, n = 193) (Fig 1).
The authenticity test of non-Chinese cell model has been performed by cell repositories and their corresponding STR profiles that are available for comparison. 129 instances out of 193 characterized samples had distinct STR profiles and consisted of 78 individual cell lines. A total of 64 instances were unequivocally misidentified or cross-contaminated (64/193, 33.2%), including four examples that were notably false cell lines (U-251, Hep2, Ej and HBL-100) for which the actual identity had been previously reported [8][9][10][11][12], contributing 6.3% (4/64) of the contamination rate. The remaining 60 contaminated cell lines, contributing an incidence of 93.7% (60/64), had authenticated prototypes deposited in cell repositories or research institutions worldwide.
Compared to the non-Chinese group, extremely high levels of contamination were found among cell lines established in China. In the Chinese model, only 19 instances out of 71 had distinct STR profiles (26.8%) and consisted of 11 individual cell lines ( Table 3). The remaining instances were unequivocally misidentified with an incidence of 73.2% (52 out of 71), affecting 24 cancer cell lines.

Most common contaminants
In most instances (89/128), the cross-contaminating intruder could be identified via comparison in the databases provided by cell banks. The remaining 39 cell lines could not be identified by source. Cross-contamination of cell lines causes mischaracterization, which has been demonstrated in several studies. HeLa [16] and its derivatives HeLa S3, AV3 [17] and WISH contaminations are notorious problems and may be simply detected by either isoenzyme [18] or STR profiling [19]. Our findings showed evidence of 46.9% (60/128) cross-contamination was caused by HeLa, affecting 31 cell lines, which were purported to have been established from 10 types of tumor (colon, liver, breast, stomach, lung, nasopharynx, tougue, laryngeal, salivary gland and bronchia) and 3 types of normal tissue (lung, intestine and liver). The bladder cancer cell line T24 was established in 1972 and has contributed to several cell line cross-contamination [20,21]. We identified three separate instances by T-24 in two laboratories, which might have used the established cell lines: LNCaP and EJ. Two multilocus profiles revealed a close relationship between LNCaP and EJ, which was a vector for T24 cell line. More details are shown in S3 Table. The prolific contaminants varied in each group. 25 instances of 64 contaminants in the non-Chinese model were HeLa, with an incidence rate of 39.1%. Contamination caused by non-human cell lines contributed a minor proportion (n = 9, 14.1%). The remaining contaminants identified in the non-Chinese model are described as follows: HCT-15 (n = 3), AGS (n = 1), U-87MG (n = 1), H226 (n = 1), SNB19 (n = 1), ECV304 (n = 1), T24 (n = 3), CCRF-CEM (n = 2), H460 (n = 1), 769-P (n = 1), Saos-2 (n = 1) and unknown (n = 14) (Fig 2). In contrast, contaminants found in the Chinese group were attributable to a few cell lines. Out of 52 identified contaminants, 35 were caused by HeLa with an incidence rate of 67.3%. Two adenoid cystic carcinoma, ACC-2 and ACC-M were reported as being contaminated by HeLa [22]. CNE-2, derived from the Chinese patients, was reported bearing HPV sequences and suspected to be contaminated by fusion of HeLa with another nasopharyngeal cell line [23]. Different samples of SGC-7901, SMMC-7721, BEL-7402, BEL-7404, BEL-7704, HL-7702, BGC-823 and TSCCA cell lines collected from independent laboratories were contaminated by either HeLa or another contaminant, possibly caused by a second intruder when they were cultured (Table 3). Contamination caused by non-human cell lines contributed 19.2%. The remaining contaminants found in the Chinese model are described as follows: MCF-7 (n = 1), AGS-1 (n = 1), RBE (n = 1), HT-29 (n = 1), A549 (n = 2), and WSS-1 (n = 1) (Fig 3).

Genetic instability in vitro
In general, 69 HeLa cell lines showed one or more alterations in STR profile compared with the prototype cell. The differences among most of the samples from different laboratories were likely attributable to different culture conditions or passage number. This suggested HeLa cell line had remarkable degree of genetic instability as defined by STR profile. In sharp contrast, the subclones generated from K562 showed a high concordance of the clonotype homogeneity, i.e., one different clonotype was observed among the six K562 cell lines and the profile of the subclones differed from the parental culture by a single change of STR. Comparison of original cell lines with the sublines derived from them revealed that nine other cell lines including 293, BxPC-3, HT-29, DLD-1, HCT-116, Jurkat, Raji, HeLa and 5637, exhibited variable STR profiles after multiple passages in culture. The subclones of these nine cell lines shared 80-94.12% identity to the original cell lines and appeared to bear null allele and mutations (Table 4). Several studies have demonstrated that Jurkat cell line was genetically unstable and the STR profile varied slightly among samples. One allele (Y) was missing in the locus amelogenin (X, Y and X were found in different subclone cell lines in all the cell banks) when compared to the published data. However, another locus D7S820 had a different allele when compared to our data. This discrepancy could be resulted from either a technical issue in PCR, in which one allele was preferentially amplified while another was below the detectable level, or a biological variation where the Jurkat cell line from one laboratory had lost one allele. In overall, all three cell lines should be considered as Jurkat.
In contrast to the loss of heterozygosity, microsatellite instability [24], characterized by the occurrence of new alleles in tumors, is a major event in certain carcinomas brought about by mismatch repair deficiency. The extra band originated from one of two alleles with the   characteristic stutter at one or more repeat unit below the main signal and the new allele differed from the assigned ancestral allele by one repeat. Microsatellite instability (MSI) was reported in leukemia [25], gastric [26], ovarian [27], breast [28], and colon cancers [29].

Identity of HCCC-9810 and Calu-6
High degree of similarity (88.9%) was observed in one pair of cell lines, HCCC-9810 and Calu-6, by using ATCC STR database (9-loci STR database). This percent match indicated that these two cell lines were from the same origin. The HCCC-9810 cell line shared 4 same loci with the Calu-6 cell line at AMEL, D5S818, D13S317 and TH01, and differed from the Calu-6 cell line at other 5 loci by using the ATCC 9-loci STR database for comparison. However, when comparing these two cell lines by using 21-loci STR, we found that they still shared the same 4 loci as found by using 9-loci STR comparison, while the remaining 17 loci were different. The alleles of D21S11, D3S1358, D2S441, Penta E, D12S391 and FGA were totally different. When the same algorithm as used in 9-loci was adopted in 21-loci, the percent match of these two cell lines was 48.2%, indicating that they were from different origin (Table 5). To confirm the relationship between HCCC-9810 and Calu-6, direct sequencing of SNP was performed ( Table 6). The SNP genotyping at 24-loci showed that there were 8 mismatches loci between HCCC-9810 and Calu-6 cell line, suggesting that they were from different origins [30].

Discussion
We presented a panoptic survey of cell line cross-contamination among human tumor cell lines using molecular genetic methods. The STR profiling technique uses fluorescence and multiplex PCR based technology to detect STR loci with overlapping size range. The specificity and polymorphism advantages make it method of choice for the authentication of human cell lines. The misidentification and cross-contamination of cell lines are deemed to be a widespread problem in research, dating back as far as 45 years ago. Our lab started using STR for cell line authentication in 2009 with 16-STR loci. An alarming rate of cross-contamination among human cell lines used in China was reported by Fang Ye and colleagues [6] who investigated 380 human cell lines in which about 25% was cross-contaminated or misidentified. This study identified authentication problems with 46.4% of the cell lines that were analyzed and all the data were collected from 2014-2016, which were different from Fang Ye's data. A total of 128 instances of cell line cross-contamination were detected among 278 original human cell lines. Our finding suggests that the relative contribution of cell line cross-contamination has been 1.3 times higher than that found in a previous study [31]. The simplest form of misidentification was mislabeling of cell line, which could be identified if cell line was profiled regularly. The detection sensitivity for cross-contamination by a small number of contaminating cell lines was 5%, which was in concordance with our report. Therefore, regular cell line authentication could be a straightforward way of catching cell line cross-contamination early.
Undue success in cell line cultures from primary material provides valid ground for suspicion of cell line cross-contamination. Normal diploid cell lines, namely CCC-HIE-2, FHC and FSH-74int, established by the Chinese Academy of Medical Sciences [32,33], respectively, had been shown to be actually derived from HeLa cells. Instructively, spontaneous immortalization is rare in human cell lines. The HaCaT [34], HUVEC [35], SV-HUC-1 [36], and hEM15A cell  lines established from immortalized human skin cutis cells, umbilical cords, ureteral epithelium and endometrial stromal cells were totally free from cross-contamination. Close to 67% of the misidentified cell lines established in China were contaminated by the HeLa cell line. In addition to their use as research tools, false or misidentified cell lines such as HL-7702, TSCCA and BEL-7402 have been used in anti-cancer drug screening programs in respected centers in China. WISH, which has been reported as a HeLa contaminant for decades [37], was also on the panel of cell models for differentiation [38] in these institutions under its false identity-amnion cell line of human origin. Despite sporadic reports about cell line cross-contamination provided by cell repositories in China through systematic quality control check of their own cell lines, the tainted cell lines are still on sale for use. It is hard to estimate how much research work is developed on false premise.
The grave problem of cell line misidentification in China has been neglected. Lack of information has prevented researchers from keeping up to date with literature reports of cell line misidentification. Although the international cell repositories have compiled a list of crosscontaminated cell lines to alert their users, the grave problem continues to fester due to the negligence of cell line users and originators as well as journal editors and funding agencies. Making the Chinese researchers aware of the situation is the most essential step. For those scientists working with cell lines, basic authentication test such as STR profiling, isoenzyme analysis, cytochrome C oxidase subunit I (COI) barcoding and mycoplasma contamination tests should be carried out and transferring cell lines to colleagues should be avoided.
STR-based profiling methods have been used for the authentication of cell lines for more than 24 years. However, there are variations in the numbers of loci and in the specific loci comprising the published profiling. We observed that HCCC-9810 and Calu-6 did not reliably discriminate between same or different origin within 9 markers. Obviously, authentication was greatly improved when we employed the 21-loci STR profiling and SNP genotyping method. Over the last years, several STR profiling assays have been published for developmental validation of new STR system. Compared to the current STR profiling, which included larger number of strategically selected markers for the human population, the discrimination power of 9 STR markers is satisfactory. Obviously, authentication was much more improved when we employed the 21-loci STR profiling and SNP genotyping method. Unfortunately, currant STR databases such as ATCC, DSMZ and JCRB only provide 9-loci STR profiles. CCTCC (www.CCTCC.org) appealed to the scientific community to bring attention to the importance of cell line authentication. In 2016, a group of cell biologists led by CCTCC developed a program for cell line authentication. CCTCC repository generated a new 21-loci STR database (www.CCTCC.org/STR.html)). It was designed to represent a unique reference for validated molecular authentication of human cell lines, using the 21-marker commercial kit. End users could compare their STR profiles of cell lines from the website. The schema of the database included data tables, which consisted of a general table, where information on STR loci value could be entered, and match criteria have been set to 50-100% at 5% interval in accordance with ATCC, DSMZ and CCTCC standards. We hope that the verified cell lines available in CCTCC STR verification-database can strengthen the reproducibility and comparability of cell lines in different laboratories.
In conclusion, we authenticated a panel of 278 cancer cell lines. A high rate of cross-contamination (46.4%) emphasized the need for careful and frequent authentication of cell lines, possibly by more than one method. STR-based profiling method has been a great benefit to the scientific community, which has allowed cell line-based studies to be carried out without false premise resulted from misidentified or contaminated cell lines. To prevent exacerbation of the problem and reduce the frequency of cell-line misidentification and cross-contamination, the whole scientific community needs to get involved.

Cell lines
The authenticity of 278 tumor cell lines deposited in CCTCC by the 28 institutes was examined. The individual cell lines were supplied with duplicate, triplicate and quadruplicate collections from an independent source. The cell lines were grown in respective media at 37˚C in a humidified incubator filled with 5% CO 2 .
DNA extraction and STR profiling DNA was isolated from cells using a genomic DNA purification kit (Tiangen Biotech, Beijing, China). Mutiplex PCR reactions were performed on a T-Gradient thermal cycler (Biometra, Gottingen, Germany). Briefly, DNA from each cell line was amplified using the Microreader TM21D System (Suzhou Microread Genetics, Beijing, China) according to the manufacturer's instructions. The TM21D system uses primers to co-amplify 20 STR loci (including 13 combined core STR loci of DNA index system, namely CSF1PO, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, D21S11, FGA, TH01, TPOX, vWA, and 7 other loci, namely Penta D, Penta E, D19S433, D16S1043, D2S441, D12S391, D2S1338) and amelogenin. The reaction mixture contained 2 ng genomic DNA, 10× PCR buffer, 2.5 mM dNTPs, 0.2 μM of each primer, and 2.5 U of Taq DNA polymerase. The PCR protocol was as follows: denaturing at 94˚C for 5 min, denaturing at 94˚C for 30 s, annealing at 60˚C for 60 s, extension at 70˚C for 60 s in 30 cycles and final extension at 60˚C for 30 min. One μl aliquot of amplicon was mixed with 7.5 μl HiDiformamide, (Applied Biosystems, CA, USA) and 0.5 μl ROX-500 internal-lane size standards. The mixture was denatured at 95˚C for 5 min and placed on ice immediately for 5 min, finally added 1 μl allelic ladder (Suzhou Microread Genetics, Beijing, China) and loaded on a POP-7 polymer gel (Applied Biosystems, CA, USA) for electrophoresis. The samples were electrophoresed with the GeneScan program on the ABI 3130 Genetic Analyzer (Applied Biosystem, CA, USA). The amplicon sizes were determined using Gene-Mapper 3.2 (Applied Biosystems, CA, USA). Alleles were designated by comparison to the allelic ladder. Each sample was repeated at least twice to confirm the results. If the sample could not be amplified in three time repeats, it will be considered as non-human origin. For data comparison, well-characterized and validated reference data obtained from ATCC and DSMZ were used. The algorithm of ATCC STR database was used for STR profile comparison. Percent match = Number of shared alleles between query sample and database profile/Total number of alleles in the database profile. Cell lines with ! 80% match are considered to be related, derived from a common ancestry. Cell lines with < 55% match are considered to be not related. Cell lines with 55%~80% match require further authentication of relatedness.

Sensitivity evaluation of intraspecies cross-contamination
To test the sensitivity of detecting cross-contamination, AsPC-1:HeLa cell mixtures at fixed ratios of 99:1, 95:5, 90:10 and 80:20 were generated. DNA was extracted from the cell mixtures using the genomic DNA purification kit. DNA concentrations were determined by a Nano-Drop 8000 spectrophotometer. Contaminated samples were created by mixing each 50 ng/μl sample into the other to create 20%, 15%, 10%, 5% and 1% contamination by volume.

SNP genotyping
HCCC-9810 and Calu-6 cell lines were obtained from the cell bank at the Chinese Academy of Science. The genetic identification of HCCC-9810 and Calu-6 cell lines was further characterized by SNP genotyping using direct SNP sequencing and standard PCR method. The 24 selected tagging SNPs were rs5742909, rs231775, rs1800682, rs2234767, rs1800896, rs1800872, rs1805010, rs1805015, rs1801275, rs1554606, rs1800797, rs1041981 and rs7208422 in immune response genes, rs9344, rs1801270, rs1042522 in cell cycle genes, and rs8177374, rs5743305, rs3775291, rs3788935, rs3761623, rs5743836, rs352139 and rs352140 in innate response genes. Genotyping was performed on ABI Prism 3100 Genetic Analyzer in conjunction with Clustal software. Details of PCR primer sequences for SNP genotyping were listed in S1 Table. Supporting Information