Search
INFO 634 Assigment 2 (12 points)

Preparation

Please import all necessary packages and setup %matplotlib for inline plots in the report. Also, load the dataset, which you can download from here: Yelp_Usefulness_Assignment2_1.csv. There are several versions of the dataset, so please make sure that you load the correct version. Please refer to the Python Data Proprocessing for details about the dataset.

## Imports all necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from numpy import linalg
from scipy import stats
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing

## This loads the csv file from disk
yelp_data = pd.read_csv(
    filepath_or_buffer = "./data/Yelp_Usefulness_Assignment2_1.csv", sep = ",", header=0 )

print(yelp_data.head(20))
## print the dimension of the data
print(yelp_data.shape)
                   review_id  review_stars  word_count  lexical_diversity  \
0     bRGHgwAd3zfiiDMT9JyKcA             1          23           0.869565   
1     TK-0pfhHorvwZK0YhDe2fQ             5          26           0.769231   
2     XTOQ6blQzzzoK26QRJl3zg             5          71           0.760563   
3     KA9VwKYL-7I2LuQnXeuEBw             5          74           0.689189   
4     C2kblEfR4oMWR9oGhYN2cQ             5          31           0.903226   
5     mTA_VwPiWw6cubKHAsrIkQ             5          32           0.875000   
6     cnZI2W7C-D_w38WHMRer3w             1         124           0.572581   
7     UlxgrLCL9WOjJL5hZ1Zd9A             2         374           0.631016   
8     P5Sx85NU3sALCtbOx1Qgvg             2          45           0.888889   
9     Agb8ItmoRPyXPdQ8jLEgJw             4          60           0.633333   
10    g-JmmzYa4PDRKTWhuXPurg             5          21           0.761905   
11  "-jxAByrXxlQXMYbx-s37JQ"             5          97           0.762887   
12    pvfpk0afGKhCm_6eHF2HUg             4          16           0.812500   
13    64awIJhFkvTsi7HVgYYkpg             4         212           0.589623   
14    2-u_hm8jJmT6NmE2CXmEvw             4          20           0.850000   
15    p9I2_QDn8FXRDlZmsYRe-w             1          37           0.783784   
16    ERv4db1Jd0qUkC0puEIwNg             5          42           0.785714   
17    _GKA5Jp1kxxYkvJT1yTysw             1          45           0.822222   
18    Kd0fUE3pHbouy5PANkisxQ             4         170           0.658824   
19    b_XQ2S7qomhBtqTEi3EKIQ             2          81           0.802469   

    correct_spell_ratio  price_included  procon_included  joy      love  \
0              0.869565        0.000000                0  0.0  0.000000   
1              0.846154        0.000000                0  0.0  0.000000   
2              0.943662        0.014085                0  0.0  0.000000   
3              0.905405        0.000000                0  0.0  0.000000   
4              0.903226        0.000000                0  0.0  0.000000   
5              0.812500        0.031250                0  0.0  0.000000   
6              0.959677        0.008065                0  0.0  0.000000   
7              0.882353        0.000000                0  0.0  0.002667   
8              0.777778        0.000000                0  0.0  0.000000   
9              0.883333        0.000000                0  0.0  0.000000   
10             0.952381        0.000000                0  0.0  0.000000   
11             0.804124        0.010309                0  0.0  0.000000   
12             1.000000        0.000000                0  0.0  0.000000   
13             0.900943        0.000000                0  0.0  0.000000   
14             0.950000        0.000000                0  0.0  0.000000   
15             0.810811        0.000000                0  0.0  0.000000   
16             0.880952        0.023810                0  0.0  0.000000   
17             0.911111        0.000000                0  0.0  0.000000   
18             0.835294        0.011765                0  0.0  0.000000   
19             0.901235        0.037037                0  0.0  0.000000   

    affection  ...  distress  FleschReadingEase  user_review_count  \
0         0.0  ...       0.0            81.1310                  5   
1         0.0  ...       0.0            48.9568                 22   
2         0.0  ...       0.0            95.9393                 10   
3         0.0  ...       0.0            86.9222                  1   
4         0.0  ...       0.0            63.0018                 19   
5         0.0  ...       0.0            83.5138                192   
6         0.0  ...       0.0            79.5758                  3   
7         0.0  ...       0.0            70.6751                459   
8         0.0  ...       0.0            91.7543                 45   
9         0.0  ...       0.0            77.6478                  3   
10        0.0  ...       0.0            74.0150                246   
11        0.0  ...       0.0            63.5937                265   
12        0.0  ...       0.0            63.4338                 28   
13        0.0  ...       0.0            79.9034                 10   
14        0.0  ...       0.0           106.7450                  4   
15        0.0  ...       0.0            63.1022                 40   
16        0.0  ...       0.0            89.8948                 44   
17        0.0  ...       0.0            76.9711                  2   
18        0.0  ...       0.0            68.1763               1261   
19        0.0  ...       0.0            83.4026                  3   

    yelping_months  degree  betweenness  eigenvector  business_stars  \
0               24       0     0.000000     0.000000             3.0   
1                6       1     0.000000     0.000000             4.5   
2               14       0     0.000000     0.000000             4.5   
3                3       0     0.000000     0.000000             5.0   
4               40       0     0.000000     0.000000             3.5   
5               34      16     0.000035     0.006457             5.0   
6                8       1     0.000000     0.000004             4.5   
7               58      51     0.000007     0.024421             4.0   
8               74       0     0.000000     0.000000             5.0   
9                3       0     0.000000     0.000000             4.5   
10              80       1     0.000000     0.000000             4.0   
11              74      32     0.000133     0.017016             4.5   
12              15       0     0.000000     0.000000             2.5   
13               3       0     0.000000     0.000000             4.0   
14               5       4     0.000010     0.000377             3.0   
15              25       2     0.000001     0.000418             3.5   
16              81      12     0.000107     0.001598             4.0   
17              17       0     0.000000     0.000000             2.5   
18              76      12     0.000088     0.001898             3.5   
19               8       0     0.000000     0.000000             4.0   

    business_review_count       class  
0                      40      useful  
1                     319      useful  
2                     535  not_useful  
3                      28      useful  
4                      84      useful  
5                      19      useful  
6                     692      useful  
7                      47      useful  
8                     290      useful  
9                      20  not_useful  
10                    122  not_useful  
11                    168      useful  
12                     30  not_useful  
13                    446  not_useful  
14                    136  not_useful  
15                     31      useful  
16                    569      useful  
17                    123      useful  
18                   4967      useful  
19                     60  not_useful  

[20 rows x 26 columns]
(1005, 26)

1. Data Cleaning

The data from the real world is not necessarily clean. Therefore, you need to clean your data. If it is not cleaned properly, it can have a significant impact on the analysis. You need to find three problems in the loaded data. It should be noted that there might be more than three problems. As far as you can find three major problems, you will meet the expectation. Examine the data by applying descriptive analysis and other measures. The data preprocessing exercise should include the basic code and procedures for this. Once you found those problems, please describe those problems and explain how you will handle the problems. You also need to provide rationale and pros/cons for your approach in detail.

1.1. The first problem (1 point)

1.1.1 What is the first problem you found? How did you find it?

## Provide codes to find the problem
## Provide your open-ended answer

1.1.2 How will you handle this problem? Please code your approach.

## Provide your open-ended answer
## Provide codes to handle the problem

1.1.3 What are pros/cons of your approach?

## Provide your open-ended answer

1.2. The second problem (1 point)

1.2.1 What is the second problem you found? How did you find it?

## Provide codes to find the problem
## Provide your open-ended answer

1.2.2 How will you handle this problem?. Please code your approach.

## Provide your open-ended answer
## Provide codes to handle the problem

1.2.3 What are pros/cons of your approach?

## Provide your open-ended answer

1.3. The third problem (1 point)

1.3.1 What is the third problem your found? How did you find it?

## Provide codes to find the problem 
## Provide your open-ended answer

1.3.2 How will you handle the problems. Please code your approach.

## Provide your open-ended answer
## Provide codes to handle the problem

1.3.3 What are pros/cons of your approach?

## Provide your open-ended answer

2. Data Normalization + Reduction

In data mining, you need to scale data to the same range to avoid dependence on the choice of measurement units (e.g., lb vs. kg). You also want to obtain a reduced representation of the dataset that is much smaller in volume but yet produces the same (or almost the same) analytical results.

In this assignment, you will explore different options for feature subset selection (e.g., metrics and methods) and normalization (e.g., z-transformation and max-min). After exploring all options, the goal is to create the best classifier. The performance will be measured by accuracy, the ratio of the number of correct predictions to the total number of input samples, considering the even distribution of the class. You are not allowed to explore any options other than feature subset selection and normalization. For instance, you can only use one machine learning algorithm (Logistic Regression). You cannot create a new attribute, and so on. In this way, you can know how feature subset selection and normalization would contribute to the modeling. It should be noted that we are not splitting the dataset into the training set and testing set, although the generalizability of the classifiers' performance can be tested only with the unseen data (i.e., testing data). We will learn about this topic later.

Please load the dataset (Yelp_Usefulness_Assignment2_2.csv). There are several versions of the dataset, so please make sure that you load the correct version.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## This loads the csv file from disk
yelp_data = pd.read_csv(
    filepath_or_buffer = "./data/Yelp_Usefulness_Assignment2_2.csv", sep = ",", header=0 )

print(yelp_data.head(20))
                   review_id  review_stars  word_count  lexical_diversity  \
0     bRGHgwAd3zfiiDMT9JyKcA             1          23           0.869565   
1     TK-0pfhHorvwZK0YhDe2fQ             5          26           0.769231   
2     XTOQ6blQzzzoK26QRJl3zg             5          71           0.760563   
3     KA9VwKYL-7I2LuQnXeuEBw             5          74           0.689189   
4     C2kblEfR4oMWR9oGhYN2cQ             5          31           0.903226   
5     mTA_VwPiWw6cubKHAsrIkQ             5          32           0.875000   
6     cnZI2W7C-D_w38WHMRer3w             1         124           0.572581   
7     UlxgrLCL9WOjJL5hZ1Zd9A             2         374           0.631016   
8     P5Sx85NU3sALCtbOx1Qgvg             2          45           0.888889   
9     Agb8ItmoRPyXPdQ8jLEgJw             4          60           0.633333   
10    g-JmmzYa4PDRKTWhuXPurg             5          21           0.761905   
11  "-jxAByrXxlQXMYbx-s37JQ"             5          97           0.762887   
12    pvfpk0afGKhCm_6eHF2HUg             4          16           0.812500   
13    64awIJhFkvTsi7HVgYYkpg             4         212           0.589623   
14    2-u_hm8jJmT6NmE2CXmEvw             4          20           0.850000   
15    p9I2_QDn8FXRDlZmsYRe-w             1          37           0.783784   
16    ERv4db1Jd0qUkC0puEIwNg             5          42           0.785714   
17    _GKA5Jp1kxxYkvJT1yTysw             1          45           0.822222   
18    Kd0fUE3pHbouy5PANkisxQ             4         170           0.658824   
19    b_XQ2S7qomhBtqTEi3EKIQ             2          81           0.802469   

    correct_spell_ratio  price_included  procon_included  joy      love  \
0              0.869565        0.000000                0  0.0  0.000000   
1              0.846154        0.000000                0  0.0  0.000000   
2              0.943662        0.014085                0  0.0  0.000000   
3              0.905405        0.000000                0  0.0  0.000000   
4              0.903226        0.000000                0  0.0  0.000000   
5              0.812500        0.031250                0  0.0  0.000000   
6              0.959677        0.008065                0  0.0  0.000000   
7              0.882353        0.000000                0  0.0  0.002667   
8              0.777778        0.000000                0  0.0  0.000000   
9              0.883333        0.000000                0  0.0  0.000000   
10             0.952381        0.000000                0  0.0  0.000000   
11             0.804124        0.010309                0  0.0  0.000000   
12             1.000000        0.000000                0  0.0  0.000000   
13             0.900943        0.000000                0  0.0  0.000000   
14             0.950000        0.000000                0  0.0  0.000000   
15             0.810811        0.000000                0  0.0  0.000000   
16             0.880952        0.023810                0  0.0  0.000000   
17             0.911111        0.000000                0  0.0  0.000000   
18             0.835294        0.011765                0  0.0  0.000000   
19             0.901235        0.037037                0  0.0  0.000000   

    affection  ...  distress  FleschReadingEase  user_review_count  \
0         0.0  ...       0.0            81.1310                  5   
1         0.0  ...       0.0            48.9568                 22   
2         0.0  ...       0.0            95.9393                 10   
3         0.0  ...       0.0            86.9222                  1   
4         0.0  ...       0.0            63.0018                 19   
5         0.0  ...       0.0            83.5138                192   
6         0.0  ...       0.0            79.5758                  3   
7         0.0  ...       0.0            70.6751                459   
8         0.0  ...       0.0            91.7543                 45   
9         0.0  ...       0.0            77.6478                  3   
10        0.0  ...       0.0            74.0150                246   
11        0.0  ...       0.0            63.5937                265   
12        0.0  ...       0.0            63.4338                 28   
13        0.0  ...       0.0            79.9034                 10   
14        0.0  ...       0.0           106.7450                  4   
15        0.0  ...       0.0            63.1022                 40   
16        0.0  ...       0.0            89.8948                 44   
17        0.0  ...       0.0            76.9711                  2   
18        0.0  ...       0.0            68.1763               1261   
19        0.0  ...       0.0            83.4026                  3   

    yelping_months  degree  betweenness  eigenvector  business_stars  \
0               24       0     0.000000     0.000000             3.0   
1                6       1     0.000000     0.000000             4.5   
2               14       0     0.000000     0.000000             4.5   
3                3       0     0.000000     0.000000             5.0   
4               40       0     0.000000     0.000000             3.5   
5               34      16     0.000035     0.006457             5.0   
6                8       1     0.000000     0.000004             4.5   
7               58      51     0.000007     0.024421             4.0   
8               74       0     0.000000     0.000000             5.0   
9                3       0     0.000000     0.000000             4.5   
10              80       1     0.000000     0.000000             4.0   
11              74      32     0.000133     0.017016             4.5   
12              15       0     0.000000     0.000000             2.5   
13               3       0     0.000000     0.000000             4.0   
14               5       4     0.000010     0.000377             3.0   
15              25       2     0.000001     0.000418             3.5   
16              81      12     0.000107     0.001598             4.0   
17              17       0     0.000000     0.000000             2.5   
18              76      12     0.000088     0.001898             3.5   
19               8       0     0.000000     0.000000             4.0   

    business_review_count       class  
0                      40      useful  
1                     319      useful  
2                     535  not_useful  
3                      28      useful  
4                      84      useful  
5                      19      useful  
6                     692      useful  
7                      47      useful  
8                     290      useful  
9                      20  not_useful  
10                    122  not_useful  
11                    168      useful  
12                     30  not_useful  
13                    446  not_useful  
14                    136  not_useful  
15                     31      useful  
16                    569      useful  
17                    123      useful  
18                   4967      useful  
19                     60  not_useful  

[20 rows x 26 columns]

Then, we are going to build a baseline model by using the Logistic Regression classifier. At the end of the following code, you can see the accuracy of this baseline classifier (0.752). After exploring various options for the normalization and feature subset selection, you would want to beat this baseline model.

## Create feature matrix by dropping the review_id and label attribute
## Review_id is not going to helpful to predict the usefulness of reviews
X = yelp_data.drop(["review_id","class"], 1)    

## Pre-processing. Sklearn takes integer as label
## Create target attribute
yelp_data[yelp_data['class'] == 'useful'] = 1
yelp_data[yelp_data['class'] == 'not_useful'] = 0

## Specify the data type. Before specifying, the type was unknown
y = yelp_data["class"].astype('int')

## Create a model
clf = LogisticRegression()
clf.fit(X, y)

## predict target class based on the trained model 
predictions = clf.predict(X)

## Calculate the performance of the classifier
accuracy = accuracy_score(predictions, y)

print(accuracy)
0.752

Here is an example to scale data using z-scale.

## Apply z-transformation
z_scaler = preprocessing.StandardScaler()
X_scaled = z_caler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

Here is an example to scale data using max-min transformation.

## Apply Min_Max Scaler
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X)
X_minmax = pd.DataFrame(X_minmax, columns = X.columns)

Here is an example of using forward feature selection with max_min transformation.

## import the necessary libraries
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X)
X_minmax = pd.DataFrame(X_minmax, columns = X.columns)

## Sequential Forward Selection(sfs)
sfs = SFS(LogisticRegression(),
           k_features=(1,X_minmax.shape[1]),
           forward=True, 
           floating=False,
           scoring = 'accuracy',
           cv = 10)

sfs = sfs.fit(X_minmax, y)
## Get the final set of features
print(sfs.k_feature_names_)

X_selected = sfs.transform(X_minmax)

# Fit the estimator using the new feature subset
# and make a prediction on the test data
clf.fit(X_selected, y)
predictions = clf.predict(X_selected)

## Calculate the performance of the classifier
accuracy = accuracy_score(predictions, y)

print(accuracy)

Actually, the performance would be lower than the baseline classifier, if you use the forward feature selection with max_min transformation. Now, explore all possible options. For instance, you can set the "forward" argument of the SequentialFeatureSelector as "False" to use backward feature selection. If you set the "floating" argument as "True", then you will use bi-directional stepwise feature selection. You can also change "scoring" and "cv" options.

Please refer to the following link for the detailed arguments options: http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#sequential-feature-selector

Enjoy all options, find the best classifier, and answer the following questions.

2.1. Performance (1 point)

What was the best performance you got? Please provide the code for your best classifier. The student that achieves the best accuracy will be given ONE bonus point towards his/her final grade and, of course, fame and glory.

## Provide the best accuracy
## Provide codes to find the best classifier

2.2. Insight and Explanation (2 points)

Provide a couple of paragraphs description of what you tried, what worked, and what did not work. Describe the lesson you got by this exercise. Please be comprehensive to deliver what you have done. Perhaps, using graphs or tables will be helpful to find a meaningful pattern from your experiments.

## Provide your open-ended answer

3. Association Rule Mining: Theory

A typical example of association rule mining is market basket analysis. This process analyzes transaction data to find associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by understanding which items are frequently purchased together by customers. Also, online retailers can develop a recommendation system based on association rule mining.

You are going to find association rules from the following virtual transactional data.

Transaction ID Items bought
1 Computer, Mouse
2 Computer, Tablet, Smart Phone, Smart Watch
3 Computer, Smart Watch, Game Console
4 Mouse, Game Console
5 Tablet, Smart Watch, Smart Phone
6 Smart Phone, Smart Watch

3.1. Rule Calculation

3.1.1 Calculation of support (1 point)

First of all, you need to calculate the first rule, support, which is the frequency of an itemset. You are going to use relative support, which is the fraction of transactions that contain X itemset (i.e., the probability that a transaction contains X itemset). If you are not familiar with this, please watch the lecture video or read the textbook (section 6.1). Please note that both minimum support and minimum confidence are 50% (i.e., 0.5).

First, you need to calculate support for frequent 1-itemsets that consist of 1 item. If each item does not pass the minimum support threshold, it does not need to be included for 2-itemsets candidates that consist of 2 items (e.g., computer and mouse) according to the Apriori property. Examine data values in the previous table and include the calculation process. If it takes too much to use the Jupter Notebook to write the calculation process, you can write them on paper and turn them in as attached. Please create tables to list corresponding values:

Frequent 1-itemsets

Itemset Count Total # of Transactions Support Passing Minimum Support ?
Computer 3 6 0.5 Yes
Mouse # # .xx Yes / No
Smart Watch # # .xx Yes / No
Tablet # # .xx Yes / No
Smart Phone # # .xx Yes / No
Game Console # # .xx Yes / No

Frequent 2-itemsets

Itemset Count Total # of Transactions Support Passing Minimum Support ?
Computer, Mouse # # .xx Yes / No
Computer, Smart Watch # # .xx Yes / No
...

Frequent 3-itemsets

Itemset Count Total # of Transactions Support Passing Minimum Support ?
Computer, Tablet, Smart Phone # # .xx Yes / No
...

3.1.2 Association Rules (1 point)

Itemsets that pass the minimum support can be used to find the association rule X ⇒ Y. Find all the rules X ⇒ Y with minimum support and confidence. Please see the following equations:

$$ support (A⇒B) = P(AUB) $$$$ confidence (A⇒B) = P(B|A) = \frac{support(AUB)}{support(A)} $$

Please calculate the support and confidence for each rule you found and and include the calculation process. If it takes too much to use the Jupter Notebook to write the calculation process, you can write them on paper and turn them in as attached. Please create a table to list corresponding values:

Rule support (A⇒B) confidence (A⇒B) Passing Minimum Support ? Passing Minimum Confidence ?
Computer ⇒ Smart Watch .xx .xx Yes / No Yes / No
...

3.2. Pattern Evaluation Method

The support-Confidence framework is often limited because not all rules found strong by this framework are interesting. Therefore, we need to utilize other interestingness measures such as lift and chi-squared ($χ^2$). You are going to calculate those other interestingness measures from the table and compare their pros/cons.

Smart Watch Not Smart Watch
Smart Phone 500 350
Not Smart Phone 100 50


$$ Lift (A,B) = \frac{Confidence(A⇒B)}{Support(C)} = \frac{Support(AUB)}{Support(A)×Support(B)} $$

3.2.1 Calculation of interestingness measures (0.5 points)

Please calculate the lift and chi-squared ($χ^2$) for each rule and and include the calculation process. If it takes too much to use the Jupter Notebook to write the calculation process, you can write them on paper and turn them in as attached. Please create a table to list corresponding values:

Rule support (A⇒B) confidence (A⇒B) Lift (A⇒B) chi-squared ($χ^2$)
Smart Watch ⇒ Smart Phone .xx .xx .xx xx.xx
Not Smart Watch ⇒ Smart Phone .xx .xx .xx xx.xx

3.2.2 Insight and Explanation (0.5 points)

If you calculated correctly, each measure might tell you different stories. Let's assume that both minimum support and minimum confidence are 35% (i.e., .35). What do these interestingness measures tell you? Please explain what each measure means. In other words, can you tell whether there is a strong association or not? What are the pros/cons of each measure, particularly related to this example? How are you going to use these measures in the future according to what you learned?

## Provide your open-ended answer

4. Association Rule Mining: Practice

You will practice with one of the advanced pattern mining topics, quantitative association rules. So far, you have practiced with categorical variables and calculate rules based on the contingency table. However, relational and data warehouse data often involve quantitative attributes and measures. For instance, in the Yelp dataset, most attributes are not categorical variables but numeric variables. To apply association rule mining, you can convert those attributes to binary attributes by using either median or mean split. This is not an ideal approach because it will lose information, but good enough for illustration. In this assignment, you are going to use the mean split. Based on this discretization, you will find association rules and analyze those rules. Perhaps, these association rules can help you find discriminative attributes to predict the usefulness of reviews. In other words, you can use the association rule mining for feature selection. You will need to use evaluation metrics and your own (subjective) judgments to find meaningful associations.

4.1. Discretization (1 point)

Now you need to create a binary representation of attributes. We will need to include "class" attribute, but drop "review_id" attribute.

Here is an example of converting textual representation to numeric representation. The library (MLXtend) you will use 1/0 representation for binary attributes.

yelp_data['class'][yelp_data['class'] == 'useful'] = 1
yelp_data['class'][yelp_data['class'] == 'not_useful'] = 0
yelp_data["class"].astype('int')

Here is an example for descretization.

yelp_data["degree"] = np.where(yelp_data["degree"] >= np.mean(yelp_data["degree"]),1,0)
yelp_data["degree"].astype('int')

You need to iteratively apply the above transformation for all attributes.

Please load the dataset (Yelp_Usefulness_Assignment2_2.csv). As noted, there are several versions of the dataset, so please make sure that you load the correct version. Convert the data by using your own code and show the first 20 lines.

## This loads the csv file from disk
yelp_data = pd.read_csv(
    filepath_or_buffer = "./data/Yelp_Usefulness_Assignment2_2.csv", sep = ",", header=0 )

print(yelp_data.head(20))
print(yelp_data.shape)
                   review_id  review_stars  word_count  lexical_diversity  \
0     bRGHgwAd3zfiiDMT9JyKcA             1          23           0.869565   
1     TK-0pfhHorvwZK0YhDe2fQ             5          26           0.769231   
2     XTOQ6blQzzzoK26QRJl3zg             5          71           0.760563   
3     KA9VwKYL-7I2LuQnXeuEBw             5          74           0.689189   
4     C2kblEfR4oMWR9oGhYN2cQ             5          31           0.903226   
5     mTA_VwPiWw6cubKHAsrIkQ             5          32           0.875000   
6     cnZI2W7C-D_w38WHMRer3w             1         124           0.572581   
7     UlxgrLCL9WOjJL5hZ1Zd9A             2         374           0.631016   
8     P5Sx85NU3sALCtbOx1Qgvg             2          45           0.888889   
9     Agb8ItmoRPyXPdQ8jLEgJw             4          60           0.633333   
10    g-JmmzYa4PDRKTWhuXPurg             5          21           0.761905   
11  "-jxAByrXxlQXMYbx-s37JQ"             5          97           0.762887   
12    pvfpk0afGKhCm_6eHF2HUg             4          16           0.812500   
13    64awIJhFkvTsi7HVgYYkpg             4         212           0.589623   
14    2-u_hm8jJmT6NmE2CXmEvw             4          20           0.850000   
15    p9I2_QDn8FXRDlZmsYRe-w             1          37           0.783784   
16    ERv4db1Jd0qUkC0puEIwNg             5          42           0.785714   
17    _GKA5Jp1kxxYkvJT1yTysw             1          45           0.822222   
18    Kd0fUE3pHbouy5PANkisxQ             4         170           0.658824   
19    b_XQ2S7qomhBtqTEi3EKIQ             2          81           0.802469   

    correct_spell_ratio  price_included  procon_included  joy      love  \
0              0.869565        0.000000                0  0.0  0.000000   
1              0.846154        0.000000                0  0.0  0.000000   
2              0.943662        0.014085                0  0.0  0.000000   
3              0.905405        0.000000                0  0.0  0.000000   
4              0.903226        0.000000                0  0.0  0.000000   
5              0.812500        0.031250                0  0.0  0.000000   
6              0.959677        0.008065                0  0.0  0.000000   
7              0.882353        0.000000                0  0.0  0.002667   
8              0.777778        0.000000                0  0.0  0.000000   
9              0.883333        0.000000                0  0.0  0.000000   
10             0.952381        0.000000                0  0.0  0.000000   
11             0.804124        0.010309                0  0.0  0.000000   
12             1.000000        0.000000                0  0.0  0.000000   
13             0.900943        0.000000                0  0.0  0.000000   
14             0.950000        0.000000                0  0.0  0.000000   
15             0.810811        0.000000                0  0.0  0.000000   
16             0.880952        0.023810                0  0.0  0.000000   
17             0.911111        0.000000                0  0.0  0.000000   
18             0.835294        0.011765                0  0.0  0.000000   
19             0.901235        0.037037                0  0.0  0.000000   

    affection  ...  distress  FleschReadingEase  user_review_count  \
0         0.0  ...       0.0            81.1310                  5   
1         0.0  ...       0.0            48.9568                 22   
2         0.0  ...       0.0            95.9393                 10   
3         0.0  ...       0.0            86.9222                  1   
4         0.0  ...       0.0            63.0018                 19   
5         0.0  ...       0.0            83.5138                192   
6         0.0  ...       0.0            79.5758                  3   
7         0.0  ...       0.0            70.6751                459   
8         0.0  ...       0.0            91.7543                 45   
9         0.0  ...       0.0            77.6478                  3   
10        0.0  ...       0.0            74.0150                246   
11        0.0  ...       0.0            63.5937                265   
12        0.0  ...       0.0            63.4338                 28   
13        0.0  ...       0.0            79.9034                 10   
14        0.0  ...       0.0           106.7450                  4   
15        0.0  ...       0.0            63.1022                 40   
16        0.0  ...       0.0            89.8948                 44   
17        0.0  ...       0.0            76.9711                  2   
18        0.0  ...       0.0            68.1763               1261   
19        0.0  ...       0.0            83.4026                  3   

    yelping_months  degree  betweenness  eigenvector  business_stars  \
0               24       0     0.000000     0.000000             3.0   
1                6       1     0.000000     0.000000             4.5   
2               14       0     0.000000     0.000000             4.5   
3                3       0     0.000000     0.000000             5.0   
4               40       0     0.000000     0.000000             3.5   
5               34      16     0.000035     0.006457             5.0   
6                8       1     0.000000     0.000004             4.5   
7               58      51     0.000007     0.024421             4.0   
8               74       0     0.000000     0.000000             5.0   
9                3       0     0.000000     0.000000             4.5   
10              80       1     0.000000     0.000000             4.0   
11              74      32     0.000133     0.017016             4.5   
12              15       0     0.000000     0.000000             2.5   
13               3       0     0.000000     0.000000             4.0   
14               5       4     0.000010     0.000377             3.0   
15              25       2     0.000001     0.000418             3.5   
16              81      12     0.000107     0.001598             4.0   
17              17       0     0.000000     0.000000             2.5   
18              76      12     0.000088     0.001898             3.5   
19               8       0     0.000000     0.000000             4.0   

    business_review_count       class  
0                      40      useful  
1                     319      useful  
2                     535  not_useful  
3                      28      useful  
4                      84      useful  
5                      19      useful  
6                     692      useful  
7                      47      useful  
8                     290      useful  
9                      20  not_useful  
10                    122  not_useful  
11                    168      useful  
12                     30  not_useful  
13                    446  not_useful  
14                    136  not_useful  
15                     31      useful  
16                    569      useful  
17                    123      useful  
18                   4967      useful  
19                     60  not_useful  

[20 rows x 26 columns]
(1000, 26)
## Provide codes

4.2. Implementation (1 point)

Now you need to create association rules between attributes. Please refer to the Association Rule Mining exercise for more details. You need to change minimum support and/or other parameters to filter out less interesting rules.

Please provide the code for your association rules. If you were not able to transform the data in the previous step, you can load the preprocessed data (Yelp_Usefulness_Assignment2_3.csv). Although you use this preprocessed data, any points will not be deducted.

## Provide codes

4.3. Insight and Explanation (1 point)

Provide a couple of paragraphs' description of what rules you have found. Did you find any interesting rules? Pick one interesting rule and explain how it is interesting and how you used evaluation measures. Pick one not-interesting rule and explain how it is not interesting and how evaluation measures overate the rule. Describe the lesson you got by this exercise. Please be comprehensive to deliver what you learned. Perhaps, using graphs or tables will be helpful in finding a meaningful pattern from your experiments. Also, you can compare association rules with the attributes selected from Section 2.2.

## Provide your open-ended answer