Building an ML model using text data

If you have any questions about text-based machine learning models or need a professional software engineer to help with a project you're working on, schedule an intro call with CodeConda for free!

Creating a text-based machine learning (ML) model typically involves tasks such as text classification or text generation. Let's go through an example of building a text classification model using input text. We'll use the popular Natural Language Processing (NLP) library, scikit-learn, for this purpose.

In this example, we'll create a sentiment analysis model that classifies input text as either positive or negative sentiment. We'll use a dataset with labeled examples of positive and negative texts for training.

Data Collection and Preparation

Assuming you have a dataset containing text samples and their corresponding labels (sentiment), you can load and preprocess the data. In this example, we'll use a simplified dataset:

# Sample data
data = [
    ("I love this product!", "positive"),
    ("This movie is amazing.", "positive"),
    ("I hate Mondays.", "negative"),
    ("The service was terrible.", "negative"),
    # Add more examples here
]

Feature Extraction

Convert the text data into numerical features that ML models can understand. We'll use the bag-of-words approach, where each word in the text is treated as a feature. We'll use CountVectorizer from scikit-learn for this purpose.

from sklearn.feature_extraction.text import CountVectorizer

# Extract features and labels from the data
texts, labels = zip(*data)

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Transform text data into numerical feature vectors
X = vectorizer.fit_transform(texts)

Data Splitting

Split the data into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

Model Training

Choose an ML algorithm suitable for text classification (e.g., Naive Bayes, Support Vector Machines, etc.), and train the model using the training data.

from sklearn.naive_bayes import MultinomialNB

# Create the Naive Bayes classifier
classifier = MultinomialNB()

# Train the classifier on the training data
classifier.fit(X_train, y_train)

Model Evaluation

Evaluate the model's performance on the testing data.

from sklearn.metrics import accuracy_score, classification_report

# Make predictions on the testing data
y_pred = classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report with precision, recall, and F1-score
print(classification_report(y_test, y_pred))

Predictions

You can use the trained model to make predictions on new input text.

# New input text
new_text = "I am enjoying this book."

# Convert the new text to a feature vector using the same vectorizer
new_text_vectorized = vectorizer.transform([new_text])

# Make a prediction using the trained classifier
new_prediction = classifier.predict(new_text_vectorized)
print("Prediction:", new_prediction[0])

That's it! You've now built a simple text-based sentiment analysis model. Remember, the process can be expanded and improved by exploring different algorithms, performing hyperparameter tuning, and using larger and more diverse datasets for training. Additionally, more advanced techniques like word embeddings and deep learning models can be employed for more complex NLP tasks.