TherapyBot: a chatbot for mental well-being using transformers

ABSTRACT


INTRODUCTION
The chatbot system is more popular nowadays because of its ability to converse and interact with humans efficiently.It interacts using written, spoken, and visual languages.Chatbot tools are used widely in various sectors like education, marketing, medical for improving user satisfaction through timely responses.These tools can very effectively be used to solve queries of individuals with mental disorders.In situations like mental disorders, chatbots can be useful tools for individuals who are hesitating to discuss disputes related to mental health and to take advice from other human beings with direct contact.
One of the fundamental goals of artificial intelligence (AI), especially natural language processing (NLP), has been to create intelligent conversation systems that react as naturally as a person to a user question, whether for a specific task or a general one.The sector of NLP has witnessed stellar growth in the past few decades.One subdomain that stands out is conversational AI, which has seen some breakthroughs [1].One ingenious application that has been put forth is its use in the psychological space.One of the fundamental goals of AI, especially NLP, has been to create intelligent conversation systems that react as naturally as a person to a user question, whether for a specific task or a general one.This is known as conversational AI, and it encompasses a wide range of technologies such as question-answering systems [2], [3].Domain-specific/open-domain chatbots, and so on.Therapies for mental well-being in the form of chatbots are progressing a lot.A machine may help, listen to, and counsel a person in an unbiased manner without any judgment is the main reason for its progress [4].

BACKGROUND OF CONVERSATIONAL ARTIFICIAL INTELLIGENCE
Significant volumes of data on conversations for model training have been made available over the last decade, encouraging findings such as increasing customer satisfaction, conversion, and improving marketing performance.Furthermore, recent breakthroughs in deep learning, a subset of AI that uses neural networks to create intelligent systems; reinforcement learning, another subset of AI in which an agent learns to perform a specific task by interacting with its environment by being rewarded or penalized based on its actions; and multi-task learning, yet another subset of AI in which multiple learning tasks are undertaken concurrently, have aided conversational agents in evolving at an incredible rate.Many prominent industrial conversation systems have been constructed utilizing an amalgamation of supervised, unsupervised, and reinforcement learning [5].Despite their extensive usage, they face a number of issues, such as failing to understand the user's sentiment [6], loos-ing track of the dialogue [7], providing boring non-contextual replies [8], or just stumbling with modern-day lingo [9], [10].
Recent breakthroughs in NLP have provided us with language models to mitigate some of these challenges.The transformer architecture has become tremendously popular owing to the fact that it consistently outperforms other language models such as recurrent neural networks (RNNs) [11].This methodology, which utilizes fully connected neural network layers and the concept of self-attention, helps retain longer conversation histories, leading to consistent, contextual, and improved conversation.Researchers have also demonstrated how unsupervised pre-training of huge language models on a vast corpus of data leads to improved performance when fine-tuned on specific tasks.This can be observed clearly when we look at OpenAI's GPT series: GPT, GPT-2, and GPT-3 which is the best language model the world has seen yet, with its ability to cater to any language task, be it question-answering, reading comprehension, text summarization, text generation or conversation modeling [12], [13].

Mental health: a growing concern
One of the medical terms known as mental disease is often acknowledged as a mental health issue.These types of concerns consist of a wide range of broad range of complications that influence human thoughts behaviors and emotions.There are numerous types of issues like addictive behaviors of human beings, nervous situations, sadness, and problems related to diets are all symptoms of mental disease.In various surveys, it is noted that many people from different age groups face these issues these days [14].
However, when any symptoms related to mental sickness possess consistent stress and widely impact on working ability of any human being, such a health problem becomes a mental health disorder.A mental sickness can have a negative impact on anyone's happiness and create complications in their day-today life, such as doing routine work at home, kitchen while studying or teaching at school or college, working at the office or it also impacts personal life.In most situations, symptoms can be treated with a mix of medicines, and talk therapy i.e., psychotherapy.Depression and loneliness are one of the significant issues that our community is facing today.It is also observed that most people are not open about it.Hence, it is imperative to address this issue quickly on a global level.Unfortunately, the hospitality services are insufficient to solve this grave problem.In developed regions, there are around nine psychiatrists per 100,000 people [15].The situation gets worse in developing countries.

Conversational artificial intelligence in psychological space
Although there is still a lot to explore when we talk about conversational AI in a psychological space, its prospect is now visible as a guide for the prevention, remedy, and observation-up/relapse prevention of psychological troubles and mental disorders.They could be used in the future for suicide prevention.In the remedy of psychological issues, chatbots might offer tools that individuals should work with on their person.After the crowning glory of classical psychotherapy, chatbots are probably the next step to stabilize intervention effects, facilitate the transfer of the healing content material into daily life, and decrease the probability of relapse.Studies show that people find it difficult to open up to a therapist or a friend or colleague.Many do not even have access to a therapist.In such cases, the intervention of conversational AI is necessary [16].
Multiple mental health chatbots, such as Woebot and Replika, witness good results [17], [18].They help people start that initial conversation about their issues, which then becomes a regular activity.They Int J Adv Appl Sci ISSN: 2252-8814  TherapyBot: a chatbot for mental well-being using transformers (Deepak Dharrao) 3 provide a safe space for those who are not comfortable discussing their thoughts with another individual.Therefore, virtual therapy given by a chatbot could improve access to psychological treatment and is more straightforward for those who are hesitant to talk with a therapist.This project aims to create an open-domain generative model for conversational AI agents leveraging a transformer-based architecture.The agent must be able to comprehend the user input statements and generate close-to-human responses.The AI agent is expected to operate in the domain of mental health by providing psychotherapy [19], [20].

Dataset description
The Facebook dataset is an open-source dataset consisting of many open-domain conversations of about 5-6 sentences.They are between two individuals, which make up 58,881 input-output pairs.Certain sample open-domain conversations from this dataset are shown in Figure 1.

Figure 1. Sample conversations from the Facebook dataset
On the other hand, the CounselChat dataset [21], [22] includes 2,130 question-answer pairs from conversations that occur between a therapist and their client, see Figure 2.These have been scraped from counselchat.comand cover over 31 different topics ranging from 'depression' to 'substance abuse' to 'military issues', see Figure 3.The questions are relatively short in this CounselChat dataset, but most of the responses are tremendously long in terms of the number of words as shown in Figure 4. Thus, making it infeasible for us to use them.We also observe that there are a greater number of responses as compared to questions.This implies that there are multiple responses to each question in our dataset.This helps us create a more adaptive conversational model."questions" and "answers" simultaneously; ii) Replace words like "he's" or "they'd" with "he is" and "they would" simultaneously; iii) Remove special characters; iv) Tokenize the data, by breaking a sentence into multiple words aka tokens.Add 'start and end' tokens to showcase the beginning and end of every sentence; v) Encode the tokenized sentences by converting each word to a number/vector in n-dimensional space; vi) Filter out the sentences those having more than 60 tokens; and vii) Pad the final tokenized sentences to 60 tokens.
In this research to make model training feasible given the hardware constraints we have limited the maximum length of the sentence to 60 words.At the end of pre-processing, we obtain a dataset of 20,096 input-output pairs with a vocab size of 8515.

SYSTEM DESIGN
In this section, the dataset collection process, model training, and prediction tasks are explained.The system architecture as shown in Figure 5 the system design can be described in three major parts first one is dataset collection from open domain dataset, model training, and prediction from trained model.

Dataset collection
In the model training phase, the dataset comprises open domain conversations present in the "Facebook dataset" and domain-specific therapist-client question answers scraped from counselchat.com, to create the "CounselChat dataset".Due to an absolute lack of high-quality mental health-oriented conversational data, we train the model on an open-domain dataset, followed by domain-specific data [23].This ensures that the model can engage in day-to-day conversation by enhancing its knowledge, meanwhile providing therapy, and talking about sensitive topics when required.

Model training
The dataset from open domain i.e., Facebook and CounselChat fetched for training the proposed model.The system is trained and saves the logs and model weights.The predefined questions and responses on various topics are used to train the proposed system.

Prediction
The

MODEL ARCHITECTURE
The transformer model [24] was built using TensorFlow 2.0.Previously, the concept of encoderdecoder with a base model as RNN/long short-term memory (LSTM) was used for most NLP tasks but it was not efficient for understanding the long-term context.So, the concept of a transformer was introduced [25].The transformer's architecture as shown in Figure 6 is quite like the encoder decoder but the base model used here is a transformer.Every transformer has 6 layers of encoders and 6 layers of decoders [26].Each encoder in the system has a self-attention layer and a feed-forward neural network [27].The words must pass through all these layers of the encoder and then to the decoder.While the model is dealing with a word, the selfattention layer permits it to observe auxiliary positions in the input sequence for better encoding of that word.It utilizes a neural network architecture entirely based on a self-attention mechanism due to which it can work parallel and reduce the number of computations per layer [28], [29].It works with variable-sized inputs along with blocks of self-attention layers rather than using RNNs or convolutional neural networks (CNNs) like most conversational models.It is also quite good at capturing long-term context since it consists of two parts: encoder and decoder.Every word from the sentence is embedded into the vector of size 512 before passing to the first encoder using Bag of Words and Word2Vec.This embedding happens only at the bottommost encoder.The size of the vector is a hyperparameter which is the length of the longest sentence in the dataset.After completing this embedding, positional encoding is also done for each word by attaching a vector to each input word so that it understands the position of the word in sequence as well.This architecture ensures that given the clear lack of mental health data, we can leverage open-domain data and then proceed to fine-tune our model on domain-relevant data [24].We also inculcate language-check libraries into our workflow to fix grammatical errors in the response.

Attention
A detailed discussion on the first layer i.e. self-attention and how to calculate attention is provided in this section [24].The first phase in computing self-attention is to produce three vectors from every input passed in the encoder.So, for every word, we create a query vector, key vector, and value vector.These vectors are formed by multiplying the embedding by three matrices.New vectors are smaller in dimension (64) compared to embedding dimension (512).We assign weight vectors for each query, key, and value vector at the start [30].Then we multiply embedding the vector of the 1st word with the weight vector to get the 1st query vector.A similar process happens for every word of the sentence.We finish up forming a query, key, and value projection for every word in the input sentence, just to give attention to the required words.
The second step is to calculate the dot product of the query vector and key vector to get a score.The score defines how much attention to place on supplementary parts of the input sequence as we encode a word from the given sentence at a particular position.The third step is to divide the scores by the square root of the dimension of the key vector, and then we hand over the calculated result through a SoftMax function.SoftMax normalizes the scores so that they are positive and add up to one.The scaled dot product attention mechanism as shown in Figure 7(a) used in this case can be described as (1).
Where, Q is the matrix that comprises the input query, representing a vector corresponding to a word in the provided sequence.K represents entire keys, i.e. vector notations of cumulative words in the sequence.V are those values, which once more represent the vector of all the words in the sequence.The attention mechanism is presented in Figure 7 with scaled dot-product and multi-head attention techniques.Figure 7(a) presents the scaled dot-product attention mechanism.As shown in Figure 7 Here, V has a similar word sequence as Q.Although, for the attention module that consists of the encoder and the decoder sequences, V differs from the sequence signified by Q.The multi-head attention is made up of four parts: i) linear layers which then divide into multiple heads, ii) scaled dot-product attention, iii) concatenation of all these heads, and iv) final linear layer.Here each multi-head attention block accepts Q, K, and V as the inputs.
(a) (b) Figure 7. Attention mechanism (a) scaled dot-product attention and (b) multi-head attention

PROPOSED TRANSFORMER
As transformers gained a lot of attention in the technical world, we tried to implement them with some amendments.In this section, the description of the proposed transformer is provided which comprises four phases as depicted in Figure 8.It contains masking, positional encoding, encoder, and decoder.

Masking
These models are auto-regressive in nature i.e.; they make predictions one step at a time by using the outputs until that point [31].During training, we use teacher-forcing.Hence, the correct output is passed to the upcoming time step irrespective of what was predicted at the present time step.As the transformer forecasts every word, self-attention permits it to consider the words that came before it in the input sequence to forecast the next word.A look-ahead mask is used by the model to prevent it from peaking at the expected output.

Positional encoding
A positional encoding vector is added later to the initial embedding of the input sequences for each word.To provide a sense of order to the model positional encoding was added [32], [33].This is added to the input and output embedding since the model does not use any RNN layers, so this helps grasp the relative position between the words in a sentence.The proposed transformer is shown in Figure 9.

Encoder
As shown in Figure 9(a), every transformer has an encoding component and a decoding component.An encoder Figure 9(a) comprises input embedding, positional encoding, and "x" encoder layers.They are responsible for analyzing and representing the input sequence in a way the model can understand.

Decoder
As shown in Figure 9(b) i.e., another half of the transformer comprises output embedding, positional encoding, and "x" decoder layers.The Encoder and decoder layers are made up of multi-head attention and dense layers.Without the transformer decoder, there would be no way to generate the output sequence.Without an encoder, the transformer decoder would miss important contextual information, resulting in lower-quality output.Combining an encoder and transformer decoder is key to the effectiveness of the transformer architecture in NLP tasks.TherapyBot: a chatbot for mental well-being using transformers (Deepak Dharrao) 9 "sparse categorical cross entropy" [34] and the optimizer is "Adam".We further use a customized learning rate as seen in Figure 10.Here we observed that the learning rate gradually increased in a linear manner from 0.0000 to 0.0010 for training steps from 0 to 3000 training steps and the learning rate slightly decreased for training steps above 3,000.
Figure 10.Custom learning rate

Performance analysis
The performance of this transformer model when trained collectively on both datasets was analyzed using two metrics as shown in Figures 11 and 12.As shown in Figure 11 the result for loss is 0.29.It helps us find the similarity between the output generated by the model and the expected output present in our data.The lower the perplexity, the better the model is said to perform the results are shown in Figure 12.Here the result for perplexity is 1.34.The obtained results from both these metrics keep steadily decreasing, hence indicating an improvement in our chatbot's performance.

Human evaluation
We also did some manual checks by observing the outputs of this chatbot as shown in the Figure 13 screenshot captured from our implementation results.By observing the results obtained from our proposed model we can say that these types of chat-bots are very effective for starting smooth conversations with the users.We can observe that our system starts conversations with very simple question-answers, and it helps to establish a good environment for further discussion and user can easily get support from the automated system like their close friends.Though the approach gives the desired accuracy, limitations could be stated as it works only for the English language, and the accuracy may vary depending on the dataset, pre-processing, training samples, and other language-related parameters.Dataset may play a very important role here as the core learning is solely dependent on it.Anger-based evaluation [35] or stress detection using social media posts [36] can also be seen as an extension to similar problems based on the availability of good datasets.

CONCLUSION
In this paper, we have proposed a chatbot system specially for Mental Well-Being.The results obtained conclude that conversational agents can aid people in starting that initial conversation about their issues, which then becomes a regular activity, and they feel a safe space discussing their thoughts.Therefore, providing virtual therapy and hence improving access to psychological treatment.We have used Transformers to obtain a great chatbot that can track context over time and does not produce bland responses like "I don't know".Furthermore, we ensure that the responses are grammatically correct.Due to a lack of high-quality conversational data related to mental health, we use two different datasets for training, one is a vast open-domain dataset by Facebook and the other is mental health QA data obtained from Counsel Chat.The results obtained from the final model have a loss value of 0.29 and a perplexity of 1.34.The obtained results from both these metrics keep steadily decreasing, hence indicating an improvement in our chatbot's performance.
In the future researchers can explore many new avenues for this project such as using a bigger model to obtain better results, as seen in countless research experiments lately.Reinforcement learning can be incorporated for continuous user feedback integration to improve model performance.Integration of sentiment analysis using multi-task learning to improve the responses is also a promising field.Generative adversarial networks can also be explored for building chatbots.Transfer learning on state-of-the-art models can be carried out and benchmarked against the proposed approach.Other emotions detection such as anger and happiness can also be detected.It can also lead to mental stress detection using texts on social media.Overall there are many such avenues for future research work.

Figure 2 .
Figure 2. Distribution of question and response length from CounselChat

Figure 3 .
Figure 3. CounselChat number of questions by topic proposed system was trained with the help of open-domain datasets.Then this trained model is used for the prediction of responses to the newly added questions during the testing phase of research work.The responses are generated to work as the AI-assisted therapist.The implemented model could generate the responses that were validated by the experts.