import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from numpy import linalg
from scipy import stats
from scipy.stats import norm, probplot
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
%matplotlib inline
age = [13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70]
(a) What is the mean of the data? What is the median?
# Use Numpy's np.mean(age)
# Use Numpy's np.median(age)
(b) What is the mode of the data?
# Use Scipy's stats.mode(age)
(c) Plot the distribution and comment on the data’s modality (i.e., bimodal, trimodal, etc.)
# Use Matplotlib's plt.hist(age)
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
# Try: np.percentile(age, [25, 75])
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
# Try: plt.boxplot(age)
(g) What does the boxplot tell? Be thorough.
2. IRIS Data Statistics
Consult IRIS data documentation online about the dataset:
http://archive.ics.uci.edu/ml/datasets/Iris
which has been included in related packages such as Sklearn and Seaborn.
Here is an example to load data into a data frame.
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
iris_df.head()
Questions
- Does it look like a normal distribution?
- What modality is the distribution? (# modes)
- Is the distribution skewed? If so, positively or negatively?
Repeat for Each Attribute
- Repeat the above for each numeric attribute, and answer the questions.
- Pick a distribution that is unlikely a normal distribution
- Generate a QQ plot on the actual quantiles vs. normal distribution quantiles
- Is it a normal distribution according to QQ? Why or why not?
Example code to generate a QQ plot against normal:
probplot(iris_df["sepal length (cm)"], dist="norm", plot=plt)
2.3. Subset Data Distributions (2 points)
Now take a data subset for a specific target
(species). For example:
iris_df[iris_df["target"]==0.0]
will limit the data to rows with target value 0.0
.
Produce and exmamine each numeric attribute's data distribution for each target.
Example code to plot subset distribution:
sns.distplot(iris_df[iris_df["target"]==0.0]["sepal length (cm)"])
Questions
- Does it look like a normal distribution?
- What modality is the distribution? (# modes)
- Is the distribution symmetric or skewed? If so, positively or negatively?
Repeat for Each Target
- Repeat the above for each numeric attribute and each target level (0, 1, 2).
- Pick a distribution that looks like a normal distribution.
- Generate a QQ plot on the actual quantiles vs. normal distribution quantiles
- Is it a normal distribution according to QQ? Why or why not?
Example code to generate a QQ plot against the normal distribution:
probplot(iris_df[iris_df["target"]==1.0]["petal length (cm)"], dist="norm", plot=plt)
For Each Attribute
Examine the boxplots and discuss the following questions:
- Do the distributions show a constant variance on the different target levels?
- On each level, does it show a symmetric distribution? Or is it positively or negatively skewed?
- Are the means constant or different? What does it mean (about potential correlation)?
Please do the boxplots and answer the question for each numeric attribute.
Optional: You may also consider using the hue parameter to color code the target (species) on the plots:
sns.pairplot(iris, hue="species");
3.2. Visual Examination (1 point)
Examine the scatterplots and identify two pairs of numeric attributes:
- One pair of strongly correlated attributes; Briefly explain why or whether they are positively or negatively correlated?
- One pair of weakly correlated or uncorrelated attributes; Briefly explain why they appear to be uncorrelated.
For the two pairs of attributes you identified earlier, single out their Pearson correlation scores:
- Strongly correlated pair: Pearson score
- Uncorrelated pair: Pearson score
Note: please keep 3 digits after the decimal point, e.g. 0.123.
What do the scores tell you? Are they consistent with your visual observation?
3.6. Cosine 2 with Vectors from Mean (0.25 point)
Subtract the mean from each attribute value and make each attribute (column) a vector from its mean (instead of from the origin), and compute Cosine similarities. For example:
a = iris['petal_length']
a2 = a - np.mean(a)
b = iris['petal_width']
b2 = b - np.mean(b)
c2 = np.dot(a2,b2)/(linalg.norm(a2)*linalg.norm(b2))
c2
3.7. Summary and Observation (1 point)
Compile all scores (pearson, euclidean distance, cosine 1 and cosine 2) together as one table below:
Pair | Pearson | Euclidean Dist | Cosine 1 | Cosine 2 |
---|---|---|---|---|
Strong correlation: attr 1 vs. attr 2 | ||||
Weak correlation: attr 3 vs. attr 4 |
Examine the scores for each pair in the table, and answer the questions:
- Does a stronger Pearson correlation necessarily lead to a smaller or greater Euclidean distance?
- Does a strong Pearson correlation lead to a greater Cosine 1 (vectors from the origin)?
- What do you observe is the relation between Pearson correlation and Cosine 2 (vectors from the mean)?