Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis

  • Wataru Zaitsu ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

    wataru0112csi@yahoo.co.jp

    Affiliation Department of Psychological Counselling, Faculty of Psychology, Mejiro University, Tokyo, Japan

  • Mingzhe Jin

    Roles Formal analysis, Funding acquisition, Resources, Software, Supervision

    Affiliation Institute of Interdisciplinary Research, Kyoto University of Advanced Science, Kyoto, Japan

Abstract

In the first half of 2023, text-generative artificial intelligence (AI), including ChatGPT from OpenAI, has attracted considerable attention worldwide. In this study, first, we compared Japanese stylometric features of texts generated by ChatGPT, equipped with GPT-3.5 and GPT-4, and those written by humans. In this work, we performed multi-dimensional scaling (MDS) to confirm the distributions of 216 texts of three classes (72 academic papers written by 36 single authors, 72 texts generated by GPT-3.5, and 72 texts generated by GPT-4 on the basis of the titles of the aforementioned papers) focusing on the following stylometric features: (1) bigrams of parts-of-speech, (2) bigram of postpositional particle words, (3) positioning of commas, and (4) rate of function words. MDS revealed distinct distributions at each stylometric feature of GPT (3.5 and 4) and human. Although GPT-4 is more powerful than GPT-3.5 because it has more parameters, both GPT (3.5 and 4) distributions are overlapping. These results indicate that although the number of parameters may increase in the future, GPT-generated texts may not be close to that written by humans in terms of stylometric features. Second, we verified the classification performance of random forest (RF) classifier for two classes (GPT and human) focusing on Japanese stylometric features. This study revealed the high performance of RF in each stylometric feature: The RF classifier focusing on the rate of function words achieved 98.1% accuracy. Furthermore the RF classifier focusing on all stylometric features reached 100% in terms of all performance indexes (accuracy, recall, precision, and F1 score). This study concluded that at this stage we human discriminate ChatGPT from human limited to Japanese language.

Introduction

ChatGPT [1], released by OpenAI on November 30, 2022, can fluently generate human-like writings, such as essays, articles, academic papers, and news, and summarize sentences in English, Spanish, French, Japanese, among other languages. Based on generative pre-trained transformer (GPT) technology, this generative AI is a type of natural language processing (NLP) model known as large language model (LLM). GPT-3.5 is an early version of ChatGPT [1]. ChatGPT has attracted considerable attention worldwide and reached 100 million active users in two months since its release. Other chatbot systems, such as “Bard” by Google and “Bing AI” by Microsoft, have been released after ChatGPT. Although such chatbot systems can provide numerous benefits, they can also cause other problems. These chatbots can easily generate considerable fake news and exaggerate facts globally. In the educational sector, university students may directly use chatbots for writing reports, and the teachers may not detect this deception. Moreover, fake scientific papers, generated by chatbot without conducting research and experiments, could lead to various problems (i.e., plagiarism or academic fraud) in the scientific field. Most people cannot distinguish texts generated by AI and those written by humans [2, 3]. Therefore, OpenAI released an AI text classifier [4] on January 31, 2023, to detect AI-written text. However, according to OpenAI [4], the detection accuracy of this classifier is low for English texts. Though “specificity” can be evaluated (91% of human-written texts are correctly classified as “human-written”), “sensitivity” (true positive, the rate of identifying AI-generated text as “written by AI”) reached only 26%. This result indicates that 74% of AI-generated texts are incorrectly classified as “human-written.” Furthermore, the performance level of this classifier was lower for non-English languages. Therefore, verification for other language including Japanese is necessary.

Moreover, OpenAI released GPT-4 on March 14, 2023. Compared with GPT-3.5, this version achieved superior performance in terms of reasoning and conciseness. According to literature [5], the number of parameters of GPT-3.5 is 175 billion, whereas it is expected that GPT-4 has approximately 100 trillion parameters. Therefore, Study 1 investigated the following: (1) Differences in stylometric features between ChatGPT and human texts have been observed. If differences in stylometric features between ChatGPT and human texts exist, then we could classify both texts. (2) Whether stylometric features of GPT-4 generated texts are different from the earlier version (i.e., GPT-3.5) and more similar to human-written texts. Based on (1), a state-of-the-art study [6, 7] about GPT-3.5 reported the differences in parts-of-speech in English and Chinese texts produced by AI and humans. However, research involving the Japanese language is limited. Therefore, Study 1 compared three types of texts (GPT-3.5 and -4 generated, and human-written) by focusing on several Japanese stylometric features (parts-of-speech, etc.). Study 2 verified the classification performance between ChatGPT and humans considering several Japanese stylometric features using random forest (RF). Recent studies [610] have demonstrated that AI classifiers can detect AI-generated texts with high performance of approximately greater than 95%. Theocharopoulos et al. [9] compared classification performances among several classifiers, including classical machine learning (ML) such as logistic regression model, multinomial naive Bayes, and support vector machine. Furthermore, the best performances of deep learning methods (Bidirectional Encoder Representations from Transformers (BERT) and Long Short-Term Memory networks (LSTM)), attained an accuracy of 98.7%. On the other hand, Islam et al. [11] also reported that classical ML (Extremely Randomized Trees and RF) showed superior performances compared to neural classifier (LSTM). According to studies for authorship identification [12], Japanese stylometric analysis using RF achieved the best scores among several machine learning techniques (support vector machine, bagging, and boosting). Furthermore, the neural classifier results are incomprehensible (i.e., black box); however, RF classifier can calculate the understandable “importance” of variables (i.e., “whitebox” machine learning type) [13]. Therefore, we used RF as a classifier in this task.

Our study is novel because there is no research up to date for considering difference of stylometric features and verifying classification performance between AI and human for Japanese language.

Method

Sample

First, we gathered 72 Japanese academic papers for psychology (from “The Japanese Journal of Psychology” published by The Japanese Psychological Association, “The Japanese Journal of Criminal Psychology” published by The Japanese Association of Criminal Psychology, and “The Japanese Journal of Social Psychology” published by The Japanese Society of Social Psychology), written by 36 single authors. To control text lengths, texts containing approximately 1,000 characters were generated by randomly extracting the sentences from each paper, excluding citations. In case of authorship identification using stylometric analysis, texts containing more characters facilitate more accurate identification of authors; however, the minimum number of characters for valid level was determined as approximately 1,000 characters [14, 15].

Second, we generated 144 texts (72 texts of ChatGPT-3.5 and 72 texts of ChatGPT-4) of approximately 1,000 characters in Japanese. When generating texts, we instructed ChatGPT to generate Japanese papers with the same titles as each of the above academic paper: “Write ‘Introduction (or Method, Results, Conclusion)’ of a paper on ‘the title of the paper’” to control these topics in each text.

Stylometric features

We transformed the above Japanese text samples into four datasets corresponding to next four stylometric features, namely (1) bigrams of parts-of-speech, (2) bigrams of postpositional particle words, (3) positioning of commas, and (4) rate of function words. These stylometric features are efficient for classifying author and not dependent on topics. Bigram is an N-gram in case of N = 2, calculating the frequency of adjacent symbols (words, phrases, or characters) in each writing. Regarding the bigrams of parts-of-speech, we used Japanese POS tagger ChaSen [16] and counted the number of occurrences of “preposition + verb,” “verb + adjective,” and “noun + adjective” among others in each Japanese text. ChaSen could create more detailed Japanese parts-of-speech tags: for postpositional particle such as “case particle”, “binding particle”, and “ending particle”, etc. The number of combinations for the bigrams of parts-of-speech was 955 (i.e., 955 variables in dataset). Jin [17] demonstrated the effectiveness of bigram of parts-of speech in identifying Japanese authors. In the case of bigram of postpositional particle words, we particularly counted the frequency of adjacent postpositional particle words such as “binding particle (は) + case particle (を)”, “case particle (を) + case particle (が)”, “binding particle (は) + case particle (の)” and the features reached to 533 variables. Third, positioning of commas [18] is used to identify authorship: “は +, (comma)”, “が +, (comma)”, and “を +, (comma)” (48 variables). Finally, the rate of function words denotes the percentage of no-meaning words excluding noun, verb, and adjective, such as auxiliary verb, conjunction, and postpositional particle in each text (“ある/ auxiliary verb”, “さらに/ conjunction”, and “に/ postpositional particle”, the number of variables reached to 221). This study included adverbs as function words. Zaitsu & Jin [14] reported the validity of these four stylometric features in identifying Japanese authors by analyzing texts containing approximately 1,000 characters: the most effective features were the rate of function words; the next was bigrams of parts-of-speech. That study reported high-classification performance levels: 100% on sensitivity and 95.1% on specificity. Additionally, in this study, a single dataset was created to group four datasets, and the integrated dataset was analyzed to verify incremental validity using all stylometric features.

Analysis procedure

In Study 1, we analyzed 216 texts of three classes (GPT-3.5, GPT-4, and Human) in five datasets using classical MDS to examine (1) whether both AI-based distributions differ from those of human-written texts, and (2) whether distributions of GPT-4 generated text differ from those of GPT-3.5 and are closer to those by humans than of GPT-3.5. MDS can arrange objects into a dimension based on the similarity as distance between objects; therefore, the results of the configurations of objects in MDS are likely to depend on the distance. In this study, we used the symmetric Jensen-Shannon divergence distance (dSJSD) as follows [19]:

In Study 2, we conducted leave-one-out cross validation (LOOCV) using RF to verify the accuracy of classification for 216 texts. In LOOCV, we constructed an RF classifier based on 215 texts (training set), except for a text (testing set) from all texts. Next, an RF classifier classified a text (testing set) into either of two classes (AI-generated or human-written). Same procedures were conducted against all texts one by one as “Testing set”.

The R language was used for MDS (cmdscale function of the stats package) and RF (randomForest function of random Forest package). For RF, the number of decision trees was set to 1,000, and other hyperparameters were set to default.

Results

Study 1: Distributions of AI (GPT-3.5 and -4)-generated and human-written texts in each stylometric feature

Fig 1 displays the result of the distribution of texts concerning the bigrams of parts-of-speech. A part of distributions of three classes were mixed, but both AI-generated and human-written texts generally separated each other. Figs 2 (bigram of postpositional particle words) and 3 (positioning of commas) revealed similar results. These results indicate that a part of the human-written texts is undistinguishable from GPT-generated text. According to Fig 4, concerning the rate of function words, we can confirm few mixed texts. Fig 5, wherein all stylometric features are used, reveal complete discrimination of both AI and human. Figs 15 indicate that GPT-4 texts generally overlapped GPT-3.5 texts, and the distribution of GPT-4 were not close to human distributions compared with GPT-3.5.

thumbnail
Fig 1. Configuration of AI (GPT-3.5 and GPT-4)-generated and human-written texts, analyzed by MDS with dSJSD, focusing on the bigrams of parts of speech.

https://doi.org/10.1371/journal.pone.0288453.g001

thumbnail
Fig 2. Configuration of AI (GPT-3.5 and GPT-4)-generated and human-written texts, analyzed by MDS with dSJSD, focusing on the bigram of postpositional particle words.

https://doi.org/10.1371/journal.pone.0288453.g002

thumbnail
Fig 3. Configuration of AI (GPT-3.5 and GPT-4)-generated and human-written texts, analyzed by MDS with dSJSD, focusing on the positioning of commas.

https://doi.org/10.1371/journal.pone.0288453.g003

thumbnail
Fig 4. Configuration of AI (GPT-3.5 and GPT-4)-generated and human-written texts, analyzed by MDS with dSJSD, focusing on the rate of function words.

https://doi.org/10.1371/journal.pone.0288453.g004

thumbnail
Fig 5. Configuration of AI (GPT-3.5 and GPT-4)-generated and human-written texts, analyzed by MDS with dSJSD, focusing on all stylometric features.

https://doi.org/10.1371/journal.pone.0288453.g005

Study 2: Performance levels of classifying texts into two classes (AI-generated and human-written) by RF

We determined the confusion matrixes (Table 1) corresponding to four and all stylometric features and calculated four performance indexes: “Accuracy (the ratio of correct classifications among all texts),” “Recall (the ratio of correct outcomes out of all texts in each classified class),” “Precision (the ratio of correct results out of all texts in each true class),” and “F1 score (the value of harmonic mean based on combination of recall and precision,).” The details of confusion matrix and the equations for above performance metrics are described below.

Tables 26 indicate that the rate of function words show the highest accuracy. By contrast, the bigram of postpositional particle words (Table 3) exhibited the lowest performance among four stylometric features. This feature also reached 100% in one side of recall or precision.

thumbnail
Table 2. Confusion matrixes and performance indexes for the bigrams of parts-of-speech.

https://doi.org/10.1371/journal.pone.0288453.t002

thumbnail
Table 3. Confusion matrixes and performance indexes for the bigram of postpositional particle words.

https://doi.org/10.1371/journal.pone.0288453.t003

thumbnail
Table 4. Confusion matrixes and performance indexes for the positioning of commas.

https://doi.org/10.1371/journal.pone.0288453.t004

thumbnail
Table 5. Confusion matrixes and performance indexes for the rate of function words.

https://doi.org/10.1371/journal.pone.0288453.t005

thumbnail
Table 6. Confusion matrixes and performance indexes for all stylometric features.

https://doi.org/10.1371/journal.pone.0288453.t006

Additionally, we conducted 10-fold cross-validation for reference. Mean accuracies for each stylometric features are as follows: (1) the bigrams of parts-of-speech: 90.3% (SD = 18.1%), (2) the bigram of postpositional particle words: 83.4% (SD = 24.9%), (3) the positioning of commas: 91.2% (SD = 13.2%), (4) the rate of function words: 94.9% (SD = 11.1%), (5) all stylometric features: 96.3% (SD = 7.4%).

The RF classifier can identify effective stylometric features for distinguishing GPT-generated from human-written texts as high “importance” of variables: “noun + postpositional particle” in bigrams of parts-of-speech, “case particle (が) + case particle (が)” in bigram of postpositional particle words, “は +, (comma)” in the positioning of commas, and “本/ prefix” in the rate of function words.

Discussion

According to MDS in Study 1, we first discovered the gaps in each stylometric feature between AI (GPT-3.5 and -4)-generated and human-written papers, particularly concerning the rate of function words among four features. These results indicated the obvious fact that AI and humans have a distinct writing process and mechanism. Furthermore, the distributions of both AI (GPT-3.5 and -4) showed considerable overlaps despite the number of parameters for both AI differing considerably. The distributions of GPT-4 were not closer to the distributions of humans compared with that of GPT-3.5. Thus, even if the number of parameters increases in the future, the distribution of ChatGPT-generated texts may not be close to that generated by humans in each stylometric feature. Furthermore, similar research approaches need to be applied to the other chatbots. Second, in Study 2, we obtained high classification performance with the RF classifier focusing on several Japanese stylometric features. We could verify incremental validity because the RF for all stylometric features achieved higher performance (accuracy is 100%!) compared to that for each single stylometric feature. Summarizing the results, the further the distributions of texts of AI and human to each other in MDS, the better the classification performance will be by RF.

This current study has several limitations. First, the authors considered only Japanese language. According to the MMLU (massive multitask language understanding) benchmark [20], GPT-4 achieved the best performance in case of English, whereas its performance for Japanese language was ranked 16th. Japanese language is distinctive because of the co-existence of various forms (Kanji, Hiragana, and Katakana) in Japanese sentences and no space between words unlike English. Therefore, the analysis of English may yield distinct results from the current study. Second, academic papers exhibit lower flexibility because the writing has to be according to the rules of each academic association. In future studies, materials with higher degrees of freedom, such as diary, blog, Twitter, should be used as samples. Third, other than ChatGPT, there are several chatbots (“Bard” by Google and “Bing AI” by Microsoft). Different AI systems may generate texts with different distributions of stylometric features. For such reason, it is necessary to study texts generated by other AI systems through the same methodology as this current study (i.e, stylometric analysis). Finally, texts containing approximately 1,000 characters were analyzed to control each text length. For texts containing more than 1,000 characters, higher performance levels compared to the current study may be obtained. However, for cases with fewer characters, the possibility of decreased performance levels exists.

Sophisticated text-generative AI will appear in the future. Predicting the future is difficult, and we may need to study AI to control generative AI.

Conclusion

The conclusions of this study are as follows: (1) Stylometric features in Japanese were distinct between texts generated by ChatGPT (GPT-3.5 and -4) and texts written by humans. (2) Distribution of GPT-4 generated-texts overlapped the distribution of GPT-3.5 in stylometric features and were not close to those of humans. (3) Currently, we can classify AI-generated and human texts in the case of Japanese, with performance levels (accuracy, recall, precision, and F1 score) of 100%.

References

  1. 1. OpenAI [Internet]. Introducing ChatGPT; c2022 [cited 2023 May 31]. Available from: https://openai.com/blog/chatgpt.
  2. 2. Köbis N, Mossink LD. Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Comput. Hum. Behav. 2021;114: 106553.
  3. 3. Clark E, August T, Serrano S, Haduong N, Gururangan S, Smith NA. All that’s ‘Human’ is not gold: Evaluating human evaluation of generated text. Proc. 59th Annu. Meeting Assoc. Comput. Ling. 11th Int. Joint Conf. Nat. Lang. Processing. 2021;1: 7282–7296.
  4. 4. OpenAI [Internet]. New AI classifier for indicating AI-written text; c2023 [cited 2023 May 31]. Available from: https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.
  5. 5. Rudolph J, Tan S, Tan S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?. J. Appl. Learn. Teach. 2023;6(1):1–22.
  6. 6. Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y, et al. How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv preprint arXiv. 2023:2301.07597.
  7. 7. Liao W, Liu Z, Dai H, Xu S, Wu Z, Zhang Y, et al. Differentiate ChatGPT-generated and Human-written medical texts. arXiv preprint arXiv. 2023:2304.11567.
  8. 8. Ma Y, Liu J, Yi F, Cheng Q, Huang Y, Lu W, et al. AI vs. Human: Differentiation analysis of scientific content generation. arXiv preprint arXiv. 2023;2301.10416.
  9. 9. Theocharopoulos PC, Anagnostou P, Tsoukala A, Georgakopoulos SV, Tasoulis SK, Plagianakos VP. Detection of fake generated scientific abstracts. arXiv preprint arXiv. 2023;2304.06148.
  10. 10. Mitrović S, Andreoletti D, Ayoub O. ChatGPT or human? detect and explain. explaining decisions of machine learning model for detecting short ChatGPT-generated text. arXiv preprint arXiv. 2023;2301.13852.
  11. 11. Islam N, Sutradhar D, Noor H, Raya JT, Maisha MT, Farid DM. Distinguishing Human Generated Text From ChatGPT Generated Text Using Machine Learning. arXiv preprint arXiv. 2023:2306.01761.
  12. 12. Jin M, Murakami M. Randamu foresutohou ni yoru bunshou no kakite no doutei [Authorship identification using Random Forests]. Proc. Inst. Stat. Math. 2007;55(2):255–268.
  13. 13. Genuer R, Poggi JM. Random Forests with R. Cham: Springer; 2020.
  14. 14. Zaitsu W, Jin M. Tekisuto mainingu niyoru hisshashikibetsu no seikakusei narabini hantei tetsuduki no hyoujunka [Accuracy and standardized judgment procedures for author identification]. Jpn. J. Behaviormetrics. 2018;45(1):39–47.
  15. 15. Zaitsu W, Jin M. Tekisuto mainingu wo mochiita hisshashikibetsu eno sukoaringu dounyu: Mojisu ya tekisutosuu, buntaitekitokucho ga tokutenbunpu ni oyobosu eikyo [Introduction of scoring for author identification by text mining: Effects of the number of characters and texts, and the features of writing style]. Jpn. J. Forensic Sci. Technol. 2017;22(2):91–108.
  16. 16. Asahara M, Matsumoto Y. ipadic version 2.7.0 User’s Manual. Nara Institute of Science and Technology. 2003;1:42. Available from: https://ja.osdn.net/projects/ipadic/docs/ipadic-2.7.0-manual-en.pdf/en/1/ipadic-2.7.0-manual-en.pdf.pdf
  17. 17. Jin M. Hinshi no marukohu sen i no joho wo motiita kakite no dotei. Proc. 32th Annu. Meeting Behaviourmetric Soc. Jpn. 2004;384–385.
  18. 18. Jin M, Murakami M. Authors’ characteristic writing styles as seen through their use of commas. Behaviormetrika. 1993;20(1):63–76.
  19. 19. Jin M, Jiang M. Text clustering on authorship attribution based on features of punctuation usage in Chinese. Inf. 2013;16(7):4983–4990.
  20. 20. OpenAI [Internet]. GPT-4 Technical Report; c2023 [cited 2023 May 31]. Available from: https://cdn.openai.com/papers/gpt-4.pdf.