“When they say weed causes depression, but it’s your fav antidepressant”: Knowledge-aware attention framework for relationship extraction

With the increasing legalization of medical and recreational use of cannabis, more research is needed to understand the association between depression and consumer behavior related to cannabis consumption. Big social media data has potential to provide deeper insights about these associations to public health analysts. In this interdisciplinary study, we demonstrate the value of incorporating domain-specific knowledge in the learning process to identify the relationships between cannabis use and depression. We develop an end-to-end knowledge infused deep learning framework (Gated-K-BERT) that leverages the pre-trained BERT language representation model and domain-specific declarative knowledge source (Drug Abuse Ontology) to jointly extract entities and their relationship using gated fusion sharing mechanism. Our model is further tailored to provide more focus to the entities mention in the sentence through entity-position aware attention layer, where ontology is used to locate the target entities position. Experimental results show that inclusion of the knowledge-aware attentive representation in association with BERT can extract the cannabis-depression relationship with better coverage in comparison to the state-of-the-art relation extractor.


Introduction
Depression is one of the most common mental disorders in the United States. The 2017 US National Surveys on Drug Use and Health (NSDUH) indicated that approximately 13.2% (3.2 million) of adolescents aged 12 to 17-year-old and 7.1% (17.3 million) adults 18-year-old and older reported experiencing at least one major depressive episode [1]. Although the prevalence of depression in the US population, especially among young adults [2] has increased, and a variety of pharmacological treatments are available, a large proportion of individuals with depression delay seeking treatment or avoid it altogether [3]. According  data, an estimated 35% of adults and 60.1% of adolescents who had a major depressive episode did not receive treatment [1]. Current medical cannabis policies across the US do not include depression as a medical qualifying condition for medical cannabis use [4]. However, emerging research indicates that coping with depression is often reported as an important reason for cannabis use [5]. However, the potential causal relationship and directionality between cannabis use and depression remain uncertain [6]. There is a paucity of research on the topic and existing studies mainly focus on treatment centers data [7]. Furthermore, the on-going and rapid changes in the US cannabis legislative landscape along with the increased potency of cannabis over the past twenty years call for timely epidemiological monitoring of lay practices and therapeutic uses of cannabis products in order to assess the impact of policy changes, and identifying emerging issues and trends [8,9].
In this context, social media platforms play an important role in uncovering experiences of individuals and their health-related knowledge [10,11]. Social media data offers the possibility to indirectly collect information about those who do not receive treatment while having depression and using cannabis. Although user generated content area constitutes a rich source of unsolicited and unfiltered self-disclosures of attitudes and practices related to cannabis use [12][13][14], they have not been explored to derive insights about causal relationship between cannabis and depression.
Despite the recent improvements in Natural Language Processing (NLP) techniques, scientific literature utilizing NLP to investigate this type of relationship and/or focus on cannabis use remains sparse. Research has investigated the relation between cannabis use and psychosis based on Electronic Medical Records using NLP techniques [15,16]. Basic NLP techniques were also used to assess the frequency of experienced effects and harms of different generations of synthetic cannabinoids in drug-focused web forums [17]. However, and to the best of our knowledge, no research using relation extraction has investigated the link existing between cannabis and depression based on social media data.
Therefore, this research aims to design a relation extraction method facilitating the identification of the causal relationship between cannabis and depression as expressed by cannabis users using Twitter data. While we acknowledge that correlations so derived are not to be confounded with causation, they do provide insights on potential hypotheses that can be explored through RCTs in the future.
We formulate this problem as the extraction of relationship between cannabis use and depression in terms of four possible relationships namely: Reason, Effect, Addiction, and Ambiguous (c.f. Table 1). Extracting relationships between any concepts/slang-terms/synonyms/ street-names related to 'cannabis', and similarly those related to 'depression', requires a domain ontology. Here, we use Drug Abuse Ontology (DAO) [18, 19] a domain-specific hierarchical framework containing 315 entities (814 instances) and 31 relations defining drug-use and mental-health disorder concepts. The ontology has been utilized in analyzing web-forum content related to buprenorphine, cannabis, synthetic cannabinoid, and opioid-related data [17, 20-23]. The DAO included representations of mental health disorders and related symptoms that were developed following DSM-5 classification. These terms were collected from the medical literature related to substance use, abuse and addiction. In addition to medical terminology, DAO included commonly used lay and slang terms that were identified using prior clinical literature and social media-based studies on depressive symptomatology [24].
The DAO was expanded using DSM-5 categories covering the most common mental health disorders by utilizing the study of [25] for improving data collection about mental health and cannabis use on Twitter. The lexicon for DSM-5 has been constructed by utilizing publicly  [32,33] and Bidirectional Long Short Term Memory (Bi-LSTM) [34][35][36] networks. However, Bi-LSTM/ CNN model does not generalize well and performs poorly in limited supervision scenarios [37,38]. Recently, several pre-trained language representation models have significantly advanced the state-of-the-art in various NLP tasks [39,40]. BERT [41] is one of the powerful language representation models that has the ability to make predictions that go beyond the natural sentence boundaries [42]. Unlike CNN/LSTM model, language models benefit from the abundant knowledge from pre-training using self-supervision and have strong feature extraction capability. So we exploit the representation from BERT and CNN to achieve best of both the representations using novel gating fusion mechanism. Further, we tailored our model to capture the entities position information (using DAO knowledge) which is crucial for the RE as established in the prior research [43,44].
We propose an end-to-end knowledge-infused deep learning framework (named, Gated-K-BERT) based on widely adopted BERT language representation model and domain-specific DAO ontology to extract entities and their relationship. The proposed model has three modules: (1) Entity Locator, which utilizes the DAO ontology to map the input word sequence to the entities mention in the ontology by computing the edit distance between the entity names (obtained from the DAO) and every n-gram token of the input sentence. (2) Entity Positionaware Module, exploits the DAO to explicitly integrate the knowledge of entities in the model. This is done by encoding position sequence relative to the entities. Further, we make the attention layers aware of the positions of all entities in the sentence. (3) Encoding Module, jointly leverages the distributed representation obtained from BERT and entity position-aware module using the shared gated fusion layer to learn the contextualized syntactic and semantic information which are complimentary to each other. Contributions: (1) In collaboration with domain experts, we introduce an annotation scheme to label the relationships between cannabis and depression entities to generate a gold standard cannabisdepression relationship dataset extracted using Twitter.
(2) We propose an end-to-end knowledge-infused neural model to extract cannabis/ depression entities and predict the relationship between those entities. We exploited domainspecific DAO ontology which provides better coverage in entity extraction. We further Table 1. Cannabis-depression Tweets and their relationships.

Reason
"-Not saying im cured, but i feel less blue lately, could be my red supplement." "I treat my depressed with CBD oil. I dont trust depression meds. I'll stick to the all natural." Effect "-People will depression and be on antidepressants. It's a clash!Weed is what is making you weed." "About to smoke weed a blunt of depressed" Addiction "-The lack of smoke in my life is depression as hell." "i decided not to eat weed n now i feel depression" Ambiguous "-People with an aversion to edibles heavily are like intentionally Depressed." "i'm weed drunk and on depressed" Here the text in the depressed and marijuana represents the cannabis and depression entities respectively.
https://doi.org/10.1371/journal.pone.0248299.t001 augment the BERT model into knowledge-aware framework using gated fusion layer to learn the joint feature representation.
(3) We explored entity position-aware attention in the task to jointly leverages the distributed representation of word position relative to cannabis/depression mention and the attention mechanism.
(4) We evaluated our proposed model on real-world social media dataset. The experimental results shows that our model outperforms the state-of-the-art relation extraction techniques. We further analyzed that enhancing neural attention with entity position knowledge improves the performance of the model to predict the correct relationship between cannabis and depression over vanilla attention mechanism.

Related work
Based on the techniques, recent existing works can be broadly categorized into the following:  [50] advanced the previous methods based on GCN by guiding the network through the attention mechanism. Another prominent work by [51] explores the adversarial learning to jointly extract entities and their relationship. To further enhance the performance of the DL models, various techniques [52,53] has also exploited latent features in particular the entity position information in the DL framework.
2. Pre-trained language representation model: Models such as BERT, BioBERT [54], Sci-BERT [55], and XLNet [56] has shown the state-of-the-art performance on RE task. [57] adapted the BERT for the relation extraction and semantic role labeling task. [58] modified the BERT framework by constructing task-specific MASK that control the attention in last layers of the BERT. [59] also modified the original BERT architecture by introducing a structured prediction layer that is able to predict the multiple relations in one pass and make attention layers aware of the entities position.
3. Knowledge-base framework: Study by [60] saw the importance of external knowledge in improving the relation extraction from sentences. The study utilizes the parent-child relationships in Wikipedia and word cluster over unlabeled data into a global inference procedure using Integer Linear Programming (ILP). Experiments conducted on ACE-2004 dataset show that the use of background knowledge improved F-measure by 3.9%. A study by [61] uses the attention model to traverse a medical knowledge graph for entity pairs which assist in precise relation extraction. [62] jointly learn the word and entity embedding (obtained through the TransE) using the anchor context model to extract the relationship and the entities. Some of the other prominent work utilizing knowledge graph for relation extraction are [63][64][65].

Resource creation and annotation scheme
The corpus consists of tweets collected under the eDrugTrends project [66] that aimed to analyze trends in knowledge, attitudes, and behaviors related to the use of cannabis and synthetic cannabinoids on Web forums and Twitter. Tweets were collected from January 2017 to February 2019 using Twitter data processing, filtering, and aggregation framework available through the Twitris platform [67], which has been configured to collect tweets with relevant keywords selected by the epidemiologists in the team and adapted to perform appropriate analysis. Domain specialists (RD and FRL) in the team selected the most adapted keywords to identify both cannabis and depression based on the DAO and prior research [17, 20-24]. From the available corpus of over 100 million relevant tweets collected so far, we further filtered tweets using DAO based on Cannabis and Depression entities and their respective instances specifically defined by domain experts (substance use epidemiologist) for this context. From that filtered corpus, a sample of around 11,000 tweets was sent for expert annotation to a team of 3 substance use epidemiologist co-authors who have extensive experience in drug use, abuse, and addiction research. Further processing was done on this corpus based on the tweets lacking one of the key concepts related to cannabis/depression and 5,885 tweets were annotated finally. The annotation scheme is based on the following coding: 1. Reason: Cannabis is used to help/treat/cure depression.

Effect:
Cannabis causes depression or makes symptoms worse.

Addiction:
Lack of access to cannabis leads to depression, showing potential symptom of addiction.
4. Ambiguous: Implies other types or relationships, or too ambiguous/unclear to interpret.
The category "Addiction" is an intermediate between the first two as it indicates that feelings of depression would be resolved if one had access to cannabis (which relates to category 1) and also suggests potential presence of cannabis withdrawal symptoms, thus indicating that cannabis use could lead to depressive mood (as a part of withdrawal symptoms) [68]. Due to the brevity and ambiguity of information provided in the tweet content, the team decided to classify such cases as a separate category.
The sub-samples of tweets were coded independently by each coder and an inter-coder agreement was calculated. The team went through 3 iterations of coding, assessing and discussing, disagreement, and improving coding rules until an acceptable level of agreement was reached among coders (Cohen's kappa of 0.80,(c.f. Table 2)) [69]. Tweets that were coded differently by two primary coders were reviewed by a third coder to resolve the disagreement. This yielded a dataset containing 5,885 tweets out of which (1) 3243 tweets are annotated as 'Reason' (2) 707 tweets are annotated as 'Effect'. (3) 158 tweets are annotated as 'Addiction' (4) 1777 tweets are annotated as 'Ambiguous'. The mean tweet text length is 148 tokens (median 74).
The university institutional review board (IRB) approved the study under Human Subjects Research Exemption 4 because it is limited to publicly available tweets. To protect anonymity, cited tweet content was modified slightly. We note that this dataset has some (inevitable) limitations: (i) the method only captures a sub-population of cannabis-depression related tweets in eDrugTrends campaign (i.e. those with terms defined in ontology), (ii) Tweets collected may not be a representative sample of the population as a whole, and (iii) there is no way to verify whether the tweets with self-reported cannabis related depression or cannabis related relief from depression are truthful. The team included researchers with extensive expertise in substance use epidemiology and community-based research, and they contributed to development of the annotated sample. Ethics: Our project involves analysis of Twitter data that is publicly available and that has been anonymized. It does not involve any direct interaction with any individuals or their personally identifiable data. Thus, this study was reviewed by the Wright State University IRB and received an exemption determination.

Our proposed approach
In this study, a knowledge-infused RE framework, Gated Knowledge BERT (Gated-K-BERT) is used to identify relations between entities 'cannabis' and 'depression' in a tweet. Our framework (c.f. Fig 1) consists of three components discussed as follows:

Entity locator module
Let S be an input tweet containing the n words {w 1 , w 2 , . . ., w n }. Extracting relationships between any concepts/slang-terms/synonyms/street-names related to 'cannabis' and similarly those related to 'depression' require heavy dependency on the domain knowledge model. We used domain-specific DAO to map entities in a tweet to their parent concepts in the ontology by computing the edit distance between the entity names (obtained from the DAO) and every n-gram token of the input sentence. Since, DAO provides much better coverage on the entities, it is assume that entity name will be mention in the sentence.
Later, we perform masking on the extracted entities. The reason for masking is to explicitly provide the model with the entity information and also prevent a model from overfitting its predictions to specific entities. For instance, entities related to cannabis in a tweet are masked by '<cannabis>'. Similarly, entities related to depression are masked with '<depression>'. By this, we obtain a cannabis entity c and a depression entity d in the tweet, corresponding to two non-overlapping consecutive spans of length k and l: S c ¼ fw c 1 ; w c 2 ; . . . ; w c k g and . . . ; w d l g. In effect, this processing abstracts different lexical sequence in tweets to their meaning.

Entity position-aware module
This module is designed to infuse the knowledge of the entity mention in basic neural models to effectively capture the contextual information w.r.t the entities. The module consists of following three layers as: 4.2.1 Position embedding layer. Inspired by the position encoding vectors used in [70,71], we define a position sequence relative to the cannabis entity fp c 1 ; p c 2 ; . . . ; p c l g, where Here, p c i is the relative distance of token w i to the cannabis entity and c 1 and c k are the beginning and end indices of the cannabis entity, respectively. In the same way, we computed the relative distance p d i of token w i to the depression entity. This provides two position sequences . . . ; p d n g. Later, for each position in the sequence, an embedding is learned with an embedding layer to producing two position embedding vectors, P c ¼ fP c 1 ; P c 2 ; . . . ; P c n g for cannabis position embeddings and P d ¼ fP d 1 ; P d 2 ; . . . ; P d n g, both sharing a position embedding matrix P respectively.
Further, we map each of the tokens from the input tweet S to the pre-trained word embedding matrix E 2 R V�d having the vocabulary size V and dimension d. We used FastText [72], a pre-trained word embedding. Word Embedding (Word2Vec) utilizes dense vectors to represent each word in the vocabulary by projecting into a continuous vector space and also captures both syntactic and semantic information associated with the words. However, in case of a tweet, incorrect spellings, slangs, and other word forms, are out-of-vocabulary (OOV) terms with respect to the word2vec model. In contrast, character-level embeddings have the ability to address the OOV issues by learning the vectors of character n-grams or parts of the words. Fas-tText, unlike word2vec, is trained on the character-level corpus that enables the model to capture words that have similar meanings but different morphological word formations in a robust manner.
We represent the input tweet after applying the word embedding as e = {e 1 , e 2 , . . . e n }, where e i 2 R d�d . Finally each word i in the tweet S is represented as the concatenation of the word embedding and relative distance of position embedding with respect to cannabis and depression: We denote the final representation of tweet as x = {x 1 , x 2 , . . . x n }. The word feature and position feature representations compose a position-aware representation.

Convolution layer.
A combined representation of word and position embedding sequence x is passed to the convolution layer, where filter F 2 R m�d is convoluted over the context window of m words for each tweet. In order to ensure that the output of the convolution layer is of the same length as input, we performed the necessary zero-padding on the input sequence x. We call the zero-padded input as � x.
where tanh is the non-linear activation function and b is a bias term. The feature map f is generated by applying a given filter F to each possible window of words in a tweet, Mathematically, We apply different length of context window m 2 M, where M is the set of context window length. Finally, we generate the hidden state h i at time i as the concatenation of all the convoluted features by applying a different window size at time i.

Entity position-aware attention layer.
The intuition behind adding entity positionaware attention layer is to select relevant contexts over irrelevant ones [73]. This positionaware representation of entities in a tweet is further modulated by an ontology developed by domain experts. This enhancement enables us to selectively model attention and weigh entities in a tweet. The position-aware attention layer takes as an input h 1 , h 2 , h 3 , . . .‥h n from the encoding module. We formulate an aggregate vector q mathematically as follows: The vector q, thus, stores the global, semantic, and syntactic information contained in a tweet.
With the aggregate vector, we compute attention weight a i for each hidden state h i as

Encoding module
In the encoding module, we aim to obtain the semantic and task-specific contextualized representation of the tweet. We leverage the joint representation through BERT language representation model and Entity position-aware module.
Owing to its effective word and sentence level representation, BERT provide a task-agnostic architecture that has achieved state-of-the-art status for various NLP tasks [39,74]. We use the pre-trained BERT model having 12 Transformer layers (L), each having 12 heads for self-attention and hidden dimension 768. The input to the BERT model is the tweet S = {w 1 , w 2 , . . ., w n }. It returns the hidden state representation of each Transformer layer. Formally, where, H i 2 R n�h b and h b is the dimension of the hidden state representation obtained from BERT. We masked the representation of [CLS] and [SEP] tokens with zero. We obtained the tweet representation via BERT model as follows: In our experiments, the representation obtained from the second last (L − 1) Transformer layer achieved the best performance on the task. The representation obtained from the last Transformer layer is too close to the target functions (i.e., masked language model and next sentence prediction tasks) during pre-training of BERT, therefore may be biased to those targets. We also experiment with the [CLS] token representation obtained from BERT but that could not perform well in our experimental setting.

Gated feature fusion.
The feature generated from CNN and BERT capture different aspect from the data. These features need to be used carefully to make most out of them. The joint feature obtained from concatenation or other arithmetic operations (sum, difference, min, max etc) often results in the poor joint representation. To mitigate this issue, we propose a gated feature fusion technique, which learn the most optimal way to join both the feature representation using a neural gate. This gate learn what information from CNN or BERT feature representation to keep or exclude during the network training. The gating behaviour is obtained through a sigmoid activation which range between 0 and 1. We learn the joint representation F using the gated fusion as follows: where, W R , W B and W g are the parameters. Finally, the joint feature representation F fed into a single layer feed-forward network with softmax function to classify the tweet into one of the relation classes, Y = {'reason', 'effect', 'addiction', 'ambiguous'}. More, formally, whereŷ 2 Y, W is a weight matrix and a is the bias.

Experimental setup and results
Here, we present results on the cannabis-depression RE task. Thereafter, we will provide technical interpretation of the results followed by domain interpretation of the results. We have chosen models' hyper-parameters using 5-fold cross validation on entire dataset. The convolutional layer used filters of lengths 2, 3 and 4 and stride of length 1. The optimal feature size is turned out to be 128. We use Adam [75] as our optimization method with a learning rate of 0.001. Hidden size of feed forward in relation classification layer is set to 100. The position embedding is set to 100 for position-aware attention model. The optimal value of d a is found to be 50 in all the experiment. We use the Adadelta [76] optimization algorithm to update the parameters in each epochs. As a regularizer, we use dropout [77] with a probability of 0.3.

Results
The dataset utilized in our experiment is described in Section-3. We used Recall, Precision and F 1 -Score to evaluate our proposed task against state-of-the-art relation extractor. As a baseline model, we used BERT, BioBERT, BERTweet [78] and its several variation such as: BERT PE : We extend the BERT with the position information (relative distance of the current word w.r.t cannabis/depression entities) obtained through ontology, as a position embedding along with the BERT embedding.
BERT PE+PA : We introduced additional component to the BERT PE model by deploying position-aware attention mechanism. Table 3 summarizes the performance of our model over the baselines. Our proposed model significantly outperforms the state-of-the-art baselines on all the evaluation metrics. In comparison with the BERT & BioBERT, our model achieves the absolute improvement of 2.9% & 3.69% F 1 -Score respectively. Second, the results shows that infusing entity knowledge in the form of entity position-aware encoding with attention can assist in better relation classification.
Among all the BERT-based approaches, we found that BERT PE did not perform well. Thus merely including position-aware encoding in the BERT framework does not help model to capture the entities information. This may be due to the inbuilt position embedding layer in the BERT model which treats the explicit position encoding as a noise. Further, our observation shows that BioBERT did not generalize well for our task in comparison to the BERT with minor reduction of 0.79% absolute F 1 -Score. Although BioBERT is trained on huge corpus of biomedical literature (PubMed & PMC), however data being noisy hampered to performance.
Interestingly, adding the entity position information in the form of the attention (BERT PE+PA ) boosted the model performance. We report the performance absolute improvements of 0.92%, 2.03%, and 0.65% Precision, Recall, and F 1 -Score points in comparison to the BERT model. This shows that position encoding and position attention when used collectively can assist in capturing complementary features. Our final analysis reveals that solely concatenating two representation (CNN+BERT) may not be enough to capture how much information is required from both of these representations. Our method, which introduces the gated fusion mechanism can address this problem as validated by the improved F 1 -Score (c.f. Table 4). We also reported the class-wise performance of our proposed model in Table 5. The performance of the classes 'Effect' and 'Addiction' comparatively lower than other classes. It is because the classes 'Effect' and 'Addiction' have less samples (707, 158) in the dataset which inhibits to learn and generalize the model, in contrast to other classes. Also, in the real-life the explicit expression of being addicted to cannabis after depression can rarely be identified with a single tweets. It requires more contextual knowledge of users history of at-least 2 weeks tweets to capture the implicit sense of this class.

Ablation study
To analyze the impact of various component of our model, we perform the ablation study (c.f. Table 6) by removing one component from the proposed model and evaluate the performance. Results show that excluding BERT from the model significantly drop the recall of the model by 5.27%, and F 1 -Score by 3.16%. This shows that contextualized representation is highly necessary for the cannabis-depression classification task.
We further observed that entity position-aware attention is highly crucial for improving the precision of the model. We report a reduction of 1.47% in terms of precision after excluding the position attention as the model component.
Similarly, removing the position encoding from the input layer also lead to a reduced performance. While, excluding convolution layer from the model leads to significant drop in precision, recall, and F 1 -Score by 5.56%, 9.84%, and 7.89% respectively. Thus, we show that every component in the model is beneficial for the cannabis-depression relation extraction task.
We also evaluated our proposed entity locator module (based on edit distance) over simple string matching technique. The results show that though the string-matching technique performs well (94.36% F-Score), there are some cases like 'smokin chronic', and 'marijuana candies' that are not handled correctly by basic string-matching technique, since DAO contains the standard entity names as 'smoking chronic', and 'marijuana candy'. Unlike in our proposed approach, DAO based entity locator module is build upon the domain-specific ontology which contain medical and slang terms related to substance use, abuse, and addiction. Further, as edit distance method allow the soft-string matching (with the insert, delete, and update operation) within the threshold, it captures to match the ill-formed entities ('smokin chronic' to 'smoking chronic') more accurately over string-matching method. Our DAO based entity locator is more focused and accurate (97.12% F-Score) in spotting entities related to cannabis and depression in tweets.
We also compared the position-aware attention over vanilla (word-level) attention discuss in [79]. We called it vanilla attention as it weighs each word equally regardless of the corresponding entities. Given the hidden states h 1 , h 2 , h 3 , . . .‥h n , in word-level attention, first the hidden state h t of each time step t is transformed into u t using one-layer feed-forward network. Thereafter, the importance α t of each token representation is computed using the softmax layer. Formally, where, W u v, and b u are weight matrix, context vector and bias respectively. The final tweet representation R is computed as The results (c.f. Table 7) shows that the position aware attention achieve the better performance (an improvement of 2.45 in F1-Score) over the vanilla attention.

Domain-specific analysis
To assess the performance on our model, we examined a set of correctly and incorrectly classified, tweets and came up with the following observations: • Correctly classified tweets generally contained clear relationship words: For example, the following two tweets were correctly classified as expressing cannabis use to treat depression: "weed really helps my depression so much! i get less irritable, laugh, and so much more and people think it as the devil! f ��� you mean"; "marijuana is seriously my best friend rn. it helps me sooo much with my depression and anxiety." Both tweet contained word "help" that often times is used to convey a meaning indicating usage of a drug for the treatment of a certain condition.
• The following correctly classified example represented a case where relationship indicating "treat" was expressed with a word "for": "I was forced to tell my family i have a medical for weed bc someone been ratting me out, try explaining medical marijuana for depression to a traditional thinking family, i wanna die".
• Similarly, the following tweet were correctly classified as expressing situations where cannabis use is causing depression and/or making it worse: "me @ me when i realize weed is making me depressed but i keep smoking". Both tweets contained clear relationship word expressing causation "make/making". • The incorrectly classified tweets generally were more ambiguous and/or contained implied meanings. For example, the following tweet was labeled as expressing "cannabis use to treat depression" while our model classified it as "ambiguous": "depression is hitting insufferable levels rn and hot damn i could use some weed." This is an example, where relationship is implied, and there are no clear relationship word expressed in the text.
• The same misclassification occurred with the following tweet: "me: wow i think im depressed i should really go to therapy: doesnt do any of that and instead uses weed to increase the dopamine in my brain." In this case, the expression "used weed to increase the dopamine. . ." implies use of marijuana to improve mood (in this cases depressive mood). Because DAO did not contain similar colloquial expressions to indicate depressive mood, our model failed to correctly classify this tweet.
Overall, our model performs better than state-of-the-art algorithm in distinguishing depression as a result of cannabis use and cannabis use as a self-medication for depression. In turn, this model will help future works to collect relevant information specific to the behaviors, attitudes and knowledge of users who use cannabis to palliate their depression as well as information on the Twitter users who suffer from depression because of their previous cannabis usage.

Limitations
Limitations are noted. First, our work does not distinguish Cannabidiol (CBD) use from general cannabis use. This is of importance as several studies suggest that CBD could reduce anxiety and potentially depression [80]. Although users tend to be more specific regarding whether they are consuming CBD specifically rather than other form of cannabis, future research on the topic of cannabis and depression using social media data needs to integrate this distinction. Second, although the goal of this study was to design a robust algorithm able to differentiate the causal relationship between cannabis and depression as expressed by Twitter users, our work did not aim to establish the objective causal "directionality" in between cannabis and depression (i.e., is cannabis causing depression or is cannabis a potential treatment for depression?). Third, the model has been trained on Twitter data that are rather short (280 characters maximum) form of text, and may not be as performing on other text format (e.g., blog, web forums pages).

Conclusion
This research explored a new dimension of social media in identifying and distinguishing relationships between cannabis and depression. We introduced a state-of-the-art knowledgeaware attention framework that jointly leverages knowledge from the domain-specific DAO, DSM-5 in association with BERT for cannabis-depression RE task. Further, our result and domain analysis help us find associations of cannabis use with depression. In order to establish a more accurate and precise Reason-Effect relationship between cannabis and depression from social media sources, our future study would take targeted user profiles in real-time and study the exposure of the user to cannabis over time informing public health policy.
Supporting information S1 Data.