TFIDF Amazon Reviews Design Document
Why I Did This Project?
This project was an opportunity for me to learn about Natural Language Processing techniques and also learn to preprocess text data into trainable inputs for machine learning models.
Software Skills
-
Python
Programmed in Python
-
NLP
Understanding Natural Language Processing techniques
-
Machine Learning
sklearn Python library for ML
-
Data Preprocessing
Python Libraries: Pandas and Numpy
Project Description
This project is trains multiple Machine Learning models on Amazon Review data in order to predict if a given review is 1-5 stars.
The project also utilizes TF-IDF (term frequency inverse document frequency) to convert text data into numerical vectors that the ML models can use for training.
ML Models: Perceptron, SVM, Logistic Regression, and Multinominal Naive Bayes
Steps:
Data Preprocessing:
- 1. Load Amazon Review Data
- 2. Filter data and remove any null data points
- 3. Randomly choose 20k data points from each rating of 1-5 stars
- 4. Concatenate the five datasets of 20k each
- 5. More data preprocessing/cleaning on words in dataset
- 6. Vectorize data with TFIDF
Machine Learning/NLP:
- 1. Split vector data into Train-Test split
- 2. Train each ML model on train data
- 3. Have each model predict labels for test data
- 4. Output Precision, Recall, F1-score on predictions