Bag of Words in Natural Language Processing


data science

Parth Shukla

Ml wizard


Bag of Words is a type of vectorization technique used to transform the textual data into vector form in order to feed the data to the model for training purposes. The bag of words vectorization technique is one of the most straightforward techniques ever to transform or vectorize textual data. In most cases, a bag of words is used over other vectorization techniques due to its simple implementations and efficiency.

In this article, we will discuss the bag of words in natural language processing which will help one to understand the technique more deeply and will able to answer interview questions related to the same very quickly and efficiently.

Table of Contents

  1. What is a Bag of Words? How does it Work?
  2. Order of Words in Bag of Words
  3. Stemming and Lemmatization in Bag of Words
  4. Implementing Bag of Words: Code Example
  5. Sparsity in Bag of Words
  6. Vocabulary Size in Bag of Words
  7. Conclusion

What is a Bag of Words? How Does it Work?

Bag of Words is one of the most accessible and most widely used vectorization techniques, which works very simply. Here, first of all, the unique words from the corpus are selected and denoted as a vocabulary. Once the vocabulary is created, the processing of the words and documents happens in several steps. Finally, the words or documents are represented as vectors based on the count of each word in the document.

To understand it more clearly, let us take an example.

Let us say we have two documents to convert into their vector form.

Document 1: Machine Learning is Easy. Document 2: Machine Learning is Hard.

Now the bag of words model will take all the unique words from the corpus and will call it a vocabulary. Here the words in the vocabulary would be Machine, Learning, Is, Easy, and Hard.

Now the model will simply calculate the occurrence of each word in every document and will convert it into vector form. Consider the sequence of words like Machine, Learning, is, Easy and Hard. Then for Document 1, the vector form would be [1, 1, 1, 1, 0], and for Document 2, the vector form would be [1, 1, 1, 0, 1].

Order of Words in Bag of Words

The bag of words algorithm does not capture the order of the words in the sentence as it works on the principle of vectorizing documents based on the frequency of words. It is one of the significant issues with the bag of words algorithms, and hence sometimes it is necessary not to use a bag of words due to this problem in a bag of words.

For example, if the sentence is Machine Learning is Hard, then the model would not capture the order of the words in the sentence, and the vector form generated would not signify the order of the words. Hence, sometimes we lose the essence of the sentence or word in the document.

Stemming and Lemmatization in Bag of Words

Stemming and Lemmetization are techniques used in natural language processing to transform a word into its base or root form. There can be some words in the same sentence or a vocabulary that are the same but in different tenses or a different form, Those words will be treated as different words in a bag of words, and hence it will increase the complexity and dimensionality of the algorithm.

For example, the words run, running, and runner are the same and signify the same, but the bag of words algorithm will consider these words as different words. Hence, they will be counted individually while transforming them into vector forms. In the process of stemming or lemmatization, we transform these types of words into their base or root form so that the bag of words algorithm takes them as single words. There is no additional complexity or dimensionality in the dataset.

Implementing Bag of Words: Code Example

The bag of words can be implemented with the help of countvectorize() in the sci-kit learn. It is one of the most famous and efficient approaches to a bag of words.

Let us take a dummy dataset here and implement the bag of words.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Dummy dataset
texts = ["I love this movie", "This movie is great", "I hate this movie", "This movie is terrible"]
labels = [1, 1, 0, 0]  # 1 for positive sentiment, 0 for negative sentiment

# Split the dataset into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer on the training texts and transform them into feature vectors
train_features = vectorizer.fit_transform(train_texts)

# Transform the testing texts into feature vectors
test_features = vectorizer.transform(test_texts)

# Initialize and train the classifier
classifier = MultinomialNB()
classifier.fit(train_features, train_labels)

# Predict the sentiment for the testing texts
predictions = classifier.predict(test_features)

# Evaluate the model
accuracy = np.mean(predictions == test_labels)
print("Accuracy:", accuracy)


The above code uses a dummy dataset with positive and negative sentences and uses the count vectorizer to transform the sentences into vector form. Once it has been done, the data is then fed to the Multinomial NB model, which will then train the model and give the output.

Sparsity in Bag of Words

As the bag of words algorithms works on the principle of counting the occurrence of the words in the document, every word or the vocabulary can not be there in each document, and hence there will be lots of zeros in the vector form. This phenomenon is called sparsity in the algorithm or the vector forms of the dataset.

One of the famous ways to handle such sparsity is a Principle CVomponent Analysis algorithm, where the principle components from the dataset are obtained, and the dimensionality of the dataset is reduced.

Vocabulary Size in a Bag of Words

The vocabulary consists of unique words in the corpus of the dataset, and it is necessary to have a vocabulary in a bag of words in order to transform the data into its vector form. Here the vocabulary size should be chosen wisely and should be optimum in order to have an accurate and faster algorithm. The vocabulary size is the total number of unique words present in the corpus.

The optimum size of the vocabulary is selected on the basis of domain knowledge and by experimenting with different vocabulary sizes with the algorithm and its different parameters.

Key Takeaways

  1. Bag of words is a type of vectorization technique that is used to transform textual data into their vector forms.
  2. The bag of words algorithm works on the principle of counting the occurrence of different words in the document or a sentence.
  3. Stemming and Lemmatization are necessary steps for a bag of words where the different words of the same meaning are transformed into their root or base form.
  4. The optimum vocabulary size is selected on the basis of domain knowledge and experimenting with different sized with the algorithm.
  5. Sparsity often occurs in a bag of words as all the words can not be present in a single document. Such issues can be handled with the help of dimensionality reduction techniques like PCA.

Conclusion

In this article, we discussed the bag of words algorithms, some of the famous interview questions related to it, and the answers to the same. This article will help one to understand the technique more clearly and will able to answer interview questions related to the same more easily and efficiently.