TFIDF Amazon Reviews Design Document

Why I Did This Project?


This project was an opportunity for me to learn about Natural Language Processing techniques and also learn to preprocess text data into trainable inputs for machine learning models.

Software Skills

  • design icon

    Python

    Programmed in Python

  • camera icon

    NLP

    Understanding Natural Language Processing techniques

  • camera icon

    Machine Learning

    sklearn Python library for ML

  • camera icon

    Data Preprocessing

    Python Libraries: Pandas and Numpy

Project Description

This project is trains multiple Machine Learning models on Amazon Review data in order to predict if a given review is 1-5 stars. The project also utilizes TF-IDF (term frequency inverse document frequency) to convert text data into numerical vectors that the ML models can use for training.
ML Models: Perceptron, SVM, Logistic Regression, and Multinominal Naive Bayes

Steps:

Data Preprocessing:

  1. 1. Load Amazon Review Data
  2. 2. Filter data and remove any null data points
  3. 3. Randomly choose 20k data points from each rating of 1-5 stars
  4. 4. Concatenate the five datasets of 20k each
  5. 5. More data preprocessing/cleaning on words in dataset
  6. 6. Vectorize data with TFIDF

Machine Learning/NLP:

  1. 1. Split vector data into Train-Test split
  2. 2. Train each ML model on train data
  3. 3. Have each model predict labels for test data
  4. 4. Output Precision, Recall, F1-score on predictions

Github Repo