# Chatbot Implementation | NLTK Python Cosine Similarity | Natural Language Processing | NLP tutorial

friends welcome to my canal. Dhaneshhere. a chatbot is a conversational negotiator capable of answering user queriesin the form of textbook addres or via a graphical user interface. in simplewords a chatbot is a software application that can chat with a consumer on any topic. chatbotscan be broadly categorized into two types. task oriented chatbots and general intent chatbots. the task-oriented chatbots are designed to perform specific tasks. for example a taskoriented chatbot can answer inquiries related to train reservation, pizza bringing. it canalso work as a personal medical healer or personal assistant. on the other hand generalpurpose chat boards can have open-ended discussion with the users. There is also a thirdtype of chat bots, called hybrid converse bots that can engage in both chore oriented andopen-ended discussion with the users. there are two approachings for chatbot evolution. one is learning located conversation bot and rule-based chat bot. understand based schmooze bots are the typeof chat bots that use machine learning procedures and a data set to learn to generate a responseto user queries. learning based chat bots can be further divided into two categories. Retrievalbased chatbots and generative chit-chat bots. rule-based chit-chat bots are pretty straightforward ascompared to learning-based chat bots. there are a specific adjusted of rules. if the user querymatches any principle, the answer to the query is generated, otherwise the user is notifiedthat the answer to user query doesn’t exist . so we are doing the rule-basedchatbot development with python. that’s what we are going to do in thisimplementation. i am exercising cosine affinity algorithm and the tfidfvectorizer. cosinesimilarity. cosine similarity is the cosine of the inclination between two vectors. See in math, parts can be classified into scalars and vectors. scalars are physical sums havingonly magnitude. vectors are physical sums having both proportion and future directions. you areaware of these things. because we learn it in math. see when we consider vectors, the scalarsobey the regular rules of algebra like addition propagation. scalars will follow the samerules of algebra. if i am giving example for scalars. mass, occasion these are all scalars. vectors one classic example i can give you is force. coerce is demonstrated. at that time we haveto tell the direction as well. when you consider violence or electrical battleground or magnetic field youneed to consider the direction as well. such parts we call it as vectors. and oneimportant quality is vectors won’t obey the rules of algebra like additive, multiplication. how scalars can be added. for example 5 kg mass plus 5 kg mass it is 10. 5+ 5= 10. butvectors “its just like” 5 newton plus 5 newton can be 10. can be 0. or some other value. so vectorswon’t follow the ordinary rules of algebra. for vector multiplication there are two types ofrules we follow. one is the scalar product. for vector propagation. And another one isthe vector product. so this is scalar product and vector make. these two are for the vectormultiplication. for vectors. the cosine affinity is coming from the scalar product. so i will discuss it. how it is. this cosine affinity we are defining from the scalarproduct of vectors. i am not discussing about the vector product now. cosine affinity isthe cosine of the slant between two vectors. It is the result of two vectors divided by theproduct of the two vector’s magnitude. that signifies how we are defining scalar product. “its just like” a speck b. this is the formula. a vector. a scalar product we will denote speck. forcross make or vector produce you will set a cross now. in between a and b. so a fleck bis equal to a into b into cos theta. so this is the formula we use now. a speck b is equalto a b cos theta. see what is a this? vector a. what is this this ? is vector b. whatis’ a’ now magnitude of vector a. what is b here magnitude of vector b. what is theta here? theta is the angle between vector a and vector b. theta is the angle between vector a and vectorb. so a speck b is equal to a b cos theta. so from here you are eligible to write cos theta is equal to a dot bdivided by a b. That is cos theta. that is the dot product divided among a b. so that’s what it iswritten here. the cosine similarity is the cosine of the slant between two vectors. it is the dotproduct of two vectors divided by the product of the two vector’s magnitude. the cosine similarityalgorithm was developed by the uh you know neo4j lab team and it is not officially reinforced. thecloser the cosine appraise to 1 the smallest the inclination and greater the coincide between vectors. that meanswe know the values of cos. See cos 0 is equal to 1. and cos 90 is similar to 0. you know in mathwe learn it cos 0 is similar to 1 and cos 90 equal to 0. that symbolizes the closer the cosine evaluate to1 that wants the smaller the direction, and the join between the two vectors. cosine similarityis generally used as a metric for appraise length when the magnitude of the vectors doesnot matter. that is very important . when the magnitude of the vectors “doesnt really matter”. thishappens for example when working with text data represented by word tallies. so these things wecan use it in nlp, when we work with text data. where it applied? cosine affinity is generallyused as a metric for metric for value the distance when the magnitude of the vectors doesnot matter. This happens for example when working with text data represented by word counts. thatmeans text data intends when we are working with nlp we use this. a commonly used approach to matchmatch similar substantiates is based on counting the maximum number of common texts betweenthe documents. but this approach has an inherent flaw. that is as the size of thedocument increases the number of common commands tends to increase even if the document talk aboutdifferent topics. The cosine affinity cures overcome this fundamental flaw in the count thecommon oaths or Euclidean distance approach. this is the math behind cosine affinity. ialready discussed. cos theta is equal to a speck b divided by modulus of a into modulus of b. thisone when we use a. encounter any vector. witnes vector a. if its coordinate. so this is a vectoro a. if it coordinates is( x, y, z). o a can be represented as x i plus y j plusz k. this is vector a. in the same way vector b can also be represented. then a fleck bis equal to you know uh if this is x 1 y 1 and z 1 and vector b is similar to x 2 i plus y 2 j plusis a z2 k then a scatter b is equal to x 1 x 2 plus that is the product so this term you will get itas a speck b is equal to that is the formula you use a dot b is similar to x 1 x 2 plus y 1 y 2 plus ez1 z 2. This is the formula for dot product if it is in this form. in you are well aware legion vectors.i j k are the unit vectors. i is the unit vector along x axis j is the unit vector alongy axis and k is the unit vector along z axis so vector a is similar to x 1 i plus y 1 j plus z1k and vector b is equal to x two i plus y two j plus z2 k. then a scatter b is similar to x one x twoplus y one y two plus z1z2. this is a dot b. so that evaluate can substitute. what is modulus of a? check modulus of a vector now the scientific formula is beginning of x 1 square plus y 1 square plusz 1 square. That is the modulus of vector a. in the same way dot modulus of vector b is equal toroot of x 2 square plus y 2 square plus is it is a z2 square. so that is modulus. in in terms ofx and y so from that if we are generalizing this we can write it in this form sigma of 1 to n a ib i plus sigma of 1 to n a i square plus 1 to n bi square that is a dot b so that is the math ofcosine similarity mathematically it calibrates the cosine of the angle between two vectors projectedin a multi-dimensional space in this context the two vectors i am talking about are arrayscontaining the word counts of the two documents. algorithm we call it as term frequencyalgorithm as well tf idf algorithm or word frequency algorithm tfidf stands forterm frequency inverse substantiate frequency this is a technique to quantify a word indocuments we generally compute a load to each word which signifies the importance of the word inthe document and corpus this method is widely used it is a widely used technique ininformation retrieval and textbook mining and it is easy for us to understand the sentenceas we know the semantics of the words and the sentence the computer can understand any data onlyin the form of numeric value so for the above reasons we vectorize all of the textbook so that thecomputer understand better the verse better now we will see the participate these are all the termswe this is the math of tfidf ctf idf that is that is equal to this is the equation you needto understand tf idf is equal to term frequency into inverse certificate frequency expression frequencyis tf i will i will tell you what is term frequency we will be discussing what is inversedocument frequency idf we are going to discuss ct the letters we are using t intends word orword call or d wants paper or designated of words n represents count of corpus and corpusmeans the total document mounted these are all the letters weuse in the coming discussion so the t of idf the formula for tf idf is theterm frequency into inverse substantiate frequency idf uh let’s get into try to understand term frequencythis is very simple see this measures the frequency of a word in a document term frequencymeasures the frequency of a word in a document when we are vectorizing the above-mentioned documents wecheck for each word’s count in worst case if the terms doesn’t exist in the documentthen that particular tf appraise will be 0 and in other extreme case if all the wordsin the document are same then it will be 1. the final cost of the normalized tf value willbe in the range of 0 to 1. Now term frequency tf is how often a word seems seen in the simplestway we can define term frequency tf is how often a word appears in a document divided by howmany commands there are for example the formula for term frequency is number of seasons call or message tappears in the document the whole divided by total number of calls in the document for example youhave a document with 1000 texts in that document and you want to find out the expression frequency ofthe word you know pen you want to find out the word frequency of the word pen so pen is repeatedhow many times in the document let us assume that it is repeated 10 durations if it is recurred 10 occasions how do you calculate it is 10 divided among 1000 it is 10 divided by 1000 is the answer seewhat is 10 now for example you are calculating the call frequency of the word pen this isthe word p-e-n the term frequency of the word pen in the document the word pen reproductions 10 days number of epoches the word pen comes in the document is 10. The total number of termsin the document is thousand then what is the term frequency it is one this is equal to 1 by100 this is the way you calculate it this is term frequency now i will discuss document frequencythis is rarely you use and i will excuse what is document frequency as well this measures theimportance of document in whole aim of corpus this is very similar to the term frequency theonly gap is that term free-spoken tf is frequency counter for a term t in a document d whereas dfdocument frequency is the count of manifestation of term t in the document set in that represents df isthe number of documents in which the word is present so that is another term you will usein this algorithm that is document frequency see another important point this is reallyimportant you need to understand here is the inverse document frequency seewhat is inverse document frequency see we already discussed call frequency is howcommon a word is how common the word appears in a document that is term frequency inversedocument frequency idf is how unique or rare a word is that is inverse document frequencythe equation for inverse document frequency is logarithm of total number of documents the wholedivided by number of documents with term t init that is the formula see you havethe enter of total number of documents meet logarithm of it’s a log to the base itis not underscore it is log to the sorry it is the formula is log to the base e now then total number of documents dividedby the number of documents with term t in it this is uh the inverse certificate frequencyfor example if you have 10 reports see you have the total number of documents withyou is 10 this is 10 number of documents with term t in it will be you know you can see let us assumeyou know more easier way i will then it is even more see let’s assume that we have 1000 documentsthe number of documents with the call t init is we have only 10 so you will take logarithm of1000 divided by 10 logarithm of 1000 divided by 10 see what is logarithm of thousand by 10 that islog off thousand by ten is hundred right what is logarithm of hundred it is two so two is theyou can take it as the inverse substantiate frequency inverse document frequency is how unique orrare the word is start the development firstly i will import the required libraries let me dothat let me run this i have imported nltk numpy random all these libraries i have imported andi am going to use a beautiful soup 4 library to parse the data from wikipedia and furthermoreyou know python’s regular face library re will be used for some you know pre-processingtasks on the text next step i am going to do is the creating the corpus as we said earlier we willuse the wikipedia article on tennis to create our corpus the following script you know the scripti am going to write or retrieve the wikipedia essay and extracts all the paragraphs from thearticle textbook lastly the text is converted into the lower lawsuit for easier processing let me do thescript let me write it so this is the wrote uh you know for the you know for creating the corpussee let me first run this let me yeah it’s working fine if you see this url, this is the inputto our chitchat bot input in the gumption from this only it learns the you know text and demonstrates the outputso this article this is the input to our chatbot it takes this data from the wikipedia andprocess it penalize next gradation i am going to do is t text pre-processing and i will be writing a helperfunction for that so we need to pre-process our textbook to remove all the spatial courage sorryall the special characters and empty spaces from our textbook if you are familiar with the nlplife cycle this is the stair next we need to follow that is the text pre-processing let mewrite the script for me run this yeah it’s working so we need to divide our textinto decision and words since it’s since the cosine similarity of the user inputwill actually be compared against each decision so for that you know we aregoing to execute the next script penalize it’s working now we need to create a helperfunction that will remove the punctuations from the user into text and will also lemmatizethe text lemmatization approximately refers to reducing a word to its root form you may befamiliar with the stanch and lemmatization lemmatization is more accurate compared tostemming for instance lemmatization we uh lemmatization of the word ate returnseat the word throwing will become throw and the word worse will be reduced tobad so like that so let’s write the dialogue for let me run the write so in the dialogue we first instantiatethe word lemmatizer from nltk library next we define a function performancelemmatization which takes a list of words as input and lemmatize the correspondinglemmatized list of words the punctuation stres removal roster removes the punctuationfrom the past text lastly the come underscore pre sorry do underscore treated verse redres getunderscore process to text approach takes a sentence as input tokenize it lemmatize itand then removes the punctuation from the convicts now how our chatbot responds togreetings ascertain since we are developing a rule-based chatbot we need to handle different types of userinputs in a different manner for instance for salutations we will define a dedicated functionto handle greets we will create two schedules greetings stress inputs and greetingunderscore outputs when a used enrols a salute we will try to search it in the greetingsunderscore inputs roster if the greet is met the authorities concerned will haphazardly choice a response from the greetingsunderscore outputs register let me write the write now “youre seeing” the dialogue salutes underscoreinputs and greets underscore responses and i have written a definite sorry a functiongenerate underscore greeting accentuate response as i discussed here the greetings sorry generateunderscore salute accentuate response programme is mostly responsible for validating the greetingmessage and making the equating response now how the how we will respond to user querieshow the chat bot responds to user inquiries as that’s what i am going to discuss so here we areusing two algorithms one is the tfidf vectorizer and the cosine similarity as we said earlierthe response will be generated based upon the cosine affinity of the vectorized assemble of theinput sentence and the convicts in the corp in the corpora once again this is very important theresponse will be generated based upon the cosine similarity of the vectorized pattern of the inputsentence and the decisions in the corpu the following web sites writes i am going to write importsthe dfidf vectorizer and the cosine similarity affairs let me do that let me run this see fromsklearn i have imported the tfidf vectorizer and the cosine similarity now we have everythingset up that we need to generate response to the user inquiries related to tennis we will create amethod that takes in user input concludes the cosine similarity of the user input and equates it withthe decisions in the corpus let me write that procedure the generate accentuate response methodaccepts one constant which is user input next we characterize an evacuate string tennis robo underscoreresponse we then append the user input to the register of already existing decisions after thatuh the following lines right word underscore vectorizer so those boundaries uh will we initializethe tfid influence riser and then convert all the decisions in the corpus together with the inputsentence into their corresponding vectorized shape and of the line you can see here similarunderscore vector underscore values that is equal to cosine similarity so that what it isdoing is we use the cosine similarity function to find the cosine affinity between the last itemin the all highlight text highlight vectors register which is actually the word vector for the userinput since it was appended at the end and the word vectors for all the sentences in the contextsimilar underscore sentence underscore number that is we sort the schedule containing the cosinesimilarities of the vectors the second last-place item in the schedule will actually have the highest cosineafter sorting with the user input the last last piece is the user input itself therefore we didnot select that eventually we flattened the retrieved cosine affinity and check if the similarity isequal to 0 or not you may be familiar with the flattening it will proselytize the spatial dimensioninto direct dimension that it is it proselytize the two dimensional matrices into one dimensionalvectors if cosine similarity of the accorded vector is zero that necessitates our query did not have ananswer in that case we will simply print that we are not aware of the user query otherwise if thecosine similarity is not equal to 0 that means we witnessed a decision same to the input in ourcorpus in that case we will time overtakes the indicator of the joined convict to our clause underscoresentences list that contains the collection of all convicts so that’s what such functions is doingi have illustrated you clearly now let me run this fine it’s it’s working fine nowhow do we chat with the chat bot so as a final step we need to create afunction that allows us to chat with the chat bot that we just designed to do so wewill write another helper function that will hinder performing until the subscribers kinds bye.Code iwill interpret after you know writing the write let me write the dialogue we will first setthe flag continue underscore dialog to true after that we engrave a welcome meaning to theuser asking for any input next we initialize a while curve that remains performing until the continueunderscore dialog signal is true inside the loop the user input is received which is then converted tolower keys the user input is stored in the human underscore text variable you can see inthe code if the subscribers enters the word bye the continue stres dialog is set to falseand goodbye message is engraved to the user on the other hand if the input verse is notequal to bye it is checked if the input contains names like thanks thank you etc or not if suchwords are witnessed a reply most welcome is generated otherwise if the user input is not equal to nonethe make accentuate response technique is called which delivers the user response basedon the cosine affinity as i justified formerly the response is generated the user inputis removed from the collection of sentences since we is not lack the user input to bepart of the corpus the process continues until the subscribers types bye you can see you knowwhy this type of chatbot is called a rule-based chatbot “theres plenty” of rules to followand if we want to add more functionalities to the chatbot we will we will haveto add more rules and it’s easy here regardless let me run this subscript ididn’t run for your lives penalize yeah so you can see see hello i am your friend tennis roboyou can ask any questions regarding tennis i am asking uh who is uhfederer f-e-d-r-e-r fine enter yeah roger federer is considered by countless observersto have the most complete game in modern tennis next question i am asking now who is nadal let me see what’s the answer yeah tennisrobo is telling nadal is regarded as regarded as the a greatest clay court player of all time seewhen i typed by goodbye and take care of yourself so after that there is nofurther you know dialog box materialized thanks for watching pleaselike share and subscribe thanks a good deal