Search
Getting to Know Your Data

data

Data Objects and Attribute Types

Types of Data Sets

  • Data records in relational tables, data matrices
  • Vectorized text data, document term vectors
  • Graphs and networks: the Web, social networks, molecular structures
  • Spatial (map), image, and videos
  • Sequential: time series, transactions, genetic sequence data

data

To compute is to reckon with numbers. We deal with many different types of data: transactions and records in databases, text and human languages, graphs and network data, among others.

Ultimately, all data may need to be transformed so they are computable: codes of categories that can be compared, levels ranked in an order, real numbers, decimal values, integers, or boolean values.

Important Characteristics of Structured Data

  • Dimensionality and the Curse of dimensionality
  • Sparsity: only presence counts
  • Resolution: patterns depend on the scale
  • Distribution: centrality and dispersion

Data Objects

  • Data sets are made up of data objects.
  • A data object represents an entity.
  • Also called samples, examples, instances, data points, objects, tuples.
  • Data objects are described by attributes.
    • Rows -> data objects
    • Columns -> attributes.
attribute 1 attribute 2 .. attribute k
object 1
object 2
..
object n

When we talk about data, they normally appear in a tablulated or matrix format, with data objects or instances as rows, and attributes or variables as columns.

Examples:

  • sales database: customers, store items, sales
  • medical database: patients, treatments
  • university database: students, professors, courses

Attributes

  • Attribute (or dimensions, features, variables):
    • a data field, representing a characteristic or feature of a data object.
    • E.g., customer _ID, name, address

Attributes

Data types:

  • Binary
  • Nominal
  • Ordinal
  • Numeric: quantitative
    • Interval-scaled
    • Ratio-scaled

As we employ computers to crunch data to find meaning and insight, we need to understand the basic form of data, their basic data types and implications.

The data type of a variable determines the kind of operations that can be performed on its data values. So this is something we need to pay close attention to throughout the process.

Attribute Types

  • Nominal: categories, states, or “names of things”
    • Hair_color = {auburn, black, blond, brown, grey, red, white}
    • marital status, occupation, ID numbers, zip codes
  • Binary
    • Nominal attribute with only 2 states (0 and 1)
    • Symmetric binary: both outcomes equally important, e.g., gender
    • Asymmetric binary: outcomes not equally important, e.g., medical test (positive vs. negative)

data

A categorical variable can have named values (labels or categories) without an order. We refer to such variables as nominal, of which values are from a set of names or labels. The very basic form of a nominal variable is a binary one, where there are two possible named values: true or false, yes or no, 0 or 1, etc.

For many variables, these labels are discrete, mutually exclusive choices without an explicit relation. For example, a State variable can have values such as PA and NY, and there is no relation between PA and NY. One cannot establish such relation as $PA>NY$ or $NY>PA$ unless another variable such as state population is considered. This type of variable is purely categorical, not ordinal. With a categorical variable, we can only use the equality operator to determine whether two values are equal or not, e.g. is PA the same or different from NY.

Ordinal

  1. Values have a meaningful order (ranking)
  2. Interval (difference) between successive values is not known

Examples:

  • Size {small, medium, large}
  • Education {High School, College, Graduate}

data

Ordinal variables are in fact categorical. The values are discrete numbers to be compared and ranked; but other than that, no mathematical operations can be performed on them. In fact, they do not have to be numbers. For example, an Education variable with values such as high school, college, and graduate can be compared and ranked. These are named values with an order.

Numeric: Interval Scaled

  1. Values have order (like ordinal)
  2. Measured on a scale of equal-sized units
  3. However, no true (well-defined) zero-point

Examples: temperature in C or F, calendar dates

data data

An inteval scale variable is one that is measured on a scale of equal-size units. Values can be compared and data can be ranked in an order. In addition, one can submit one value from another to calculate the difference.

A temperature in C or F is interval scaled. We can say that today's temperature is 2 degrees higher than yesterday's. However, it makes little sense to add temperature up as the total temperature.

There is no well define zero point here. It is in fact misleading to take the ratio between two temperature values and conclude that one is 10% lower than the other. Without a true zero value, such a claim has little scientific value.

So subtraction can be performed on a interval scaled variable; however, division is less meaningful.

Numeric: Ratio Scaled

  1. Values have order (like ordinal)
  2. Measured on a scale of equal-sized units
  3. Inherent (well-defined) zero-point

Money

Example:

  • Temperature in Kelvin
  • Length, counts, monetary quantities

Now, if a variable does have a well-define zero value, it is ratio-scaled. Data related to length, counts, and money are in general ratio scaled. On such a ratio-scaled variable, many mathematical operations such as subtraction, addition, and division can be performed.

So it is now meaningful to do subtraction, addition, division, and even multiplication on the ratio variable.

Discrete vs. Continuous Attributes

Discrete Attribute:

  • Has only a finite or countably infinite set of values
  • Nominal values, integer coding
  • Binary attributes are a special case of discrete attributes
  • E.g., zip codes, profession, or the set of words in a collection of documents

Continuous Attribute

  • Has real numbers as attribute values
  • Practically, real values can only be measured and represented using a finite number of digits
  • Continuous attributes are typically represented as floating-point variables
  • E.g., temperature, height, or weight

Summary of Attribute Types

Type # Values Order? Interval? Zero Point? Remarks, allowed operations
Binary 2 No No No 0 vs 1; Yes vs No
Nominal 2 or more No No No Names and labels
Ordinal 2 or more Yes No No Levels, ranks, < or >
Interval $\to\infty$ Yes Yes No Difference, subtraction
Ratio $\to\infty$ Yes Yes Yes Sum, ratio, division

The table summarizes what we have discuss about the type of variables, what make them different, and how they should be treated differently. Think about data you have experienced recently, the types of variables there, and what data types you think they should belong to. This is a useful exercise and a good starting point before putting your hands on data.

Basic Statistical Descriptions of Data

  • To better understand the data: central tendency, variation and spread
  • Data dispersion characteristics:
    • median, max, min, quantiles, outliers, variance, etc.
  • Numerical dimensions correspond to sorted intervals
  • Dispersion analysis on computed measures

Once we know the type of data we are dealing with, we can look at the basic statistics based on data samples. For numeric data, we want to look at the central tendency, variation, and spread of the distribution.

Measuring the Central Tendency

  • Mean: sum divided by the number of values
  • Median: Middle value or average of the middle two
  • Mode: the most frequent value

First, on the central tendency, we have basic statistics such as the mean, the median, and mode. The mean is simply the overall average, often estimated by a sample. The median is the value in the very middle of sorted data. The mode is the most frequent value, or values.

Symmetric vs. Skewed Data

Money
symmetric

negative skew
negatively skewed

positive skew
positively skewed

Putting these three values togehter gives us a rough idea about the kind of distribution in the data. When the mean, median, and mode are all about the same, it is a symmetric distribution where the average is in the middle and the peak.

When the values do not agree with one another, the distribution can be negatively skewed, as shown in the left figure; or positively skewed, illustrated in the right figure.

Properties of Normal Distribution

The normal (distribution) curve:

  • $\mu$: mean
  • $\sigma$: standard deviation

Observation of any normal curve:

  • From $\mu-\sigma$ to $\mu+\sigma$: contains about 68% of the measurements
  • From $\mu-2\sigma$ to $\mu+2\sigma$: contains about 95% of it
  • From $\mu-3\sigma$ to $\mu+3\sigma$: contains about 99.7% of it

Normal Distributions

Measuring the Dispersion of Data

  • Quartiles:
    1. $Q_1$ is the 25th percentile
    2. Median is the 50th percentile, or $Q_2$
    3. $Q_3$ is the 75th percentile
  • Inter-quartile range: IQR = $Q_3 - Q_1$
  • Five-number summary of a distribution
    • [Minimum, Q1, Median, Q3, Maximum]

Boxplot

Besides the mean and median, quartiles are useful for measuring the dispersion of data.

Q1 is the 25th percentile, Q2 is the 50th percentile or the median, and Q3 the 75th percentile.

Adding the minimum and maximum values to these three, we have the five number summary of a distribution.

  • Variance: mean squared error
  • Standard deviation: the square root of variance

Graphic Displays of Basic Statistical Descriptions

  1. Boxplot: graphic display of five-number summary
  2. Histogram: x-axis are values, y-axis repres. frequencies
  3. Quantile plot: values vs. quantiles
  4. Quantile-quantile (q-q) plot: compare two distributions and their quantiles
  5. Scatter plot: putting pairs (two variables) of values together

Graphic presentations of data distributions are often intuitive and helpful. Let's look at some of these tools: boxplot, histogram, quantile plot, QQ plot, and scatter plot.

Boxplot Analysis

  • Five-number summary of a distribution
  • [Minimum, Q1, Median, Q3, Maximum]
  • Boxplot
    • Usually, a vertical representation
    • Here, a horizontal plot
      • following earlier distribution plots (on X)

Boxplot

A boxplot is a visual representation of the five-number summary. As shown in the example, it depicts the median, the lower quartile Q1, the upper quartile Q2, in the context all possible values within min and max.

Example boxplot: Boxplot

Often, boxplots are presented vertically as distributions on the Y axis and X can be used to compare different levels of a predictor variable.

  • Data is represented with a box
  • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
  • The median is marked by a line within the box
  • Whiskers: two lines outside the box extended to Minimum and Maximum
  • Outliers:
    • Points beyond a specified outlier threshold, plotted individually
    • E.g. values beyond $1.5 \times IQR$, lower or higher

Where the whiskers (lines outside the box) terminate depends on Q1 - 1.5 IQR and Q3 + 1.5 IQR (or some variation of that).

However, they should be within those bounds but not necessarily at the exact values of 1.5 IQRs. In fact, they are extended/drawn to the most extreme observations/instances WITHIN the 1.5 IQR bounds.

Histogram Analysis

Histogram

  • Graph display of tabulated frequencies, shown as bars
  • Proportion of cases in each range or category

Two Histograms

Note equal intervals on the histogram here: $height == area$

A histogram is a graphic display of frequencies, in a number of ranges or bins.

Because of the bar representation, this is very similar and sometimes equivalent to a bar chart. And the histogram here is indeed a bar chart, where the ranges have equal intervals (widths) -- that is, each bar is about the same 20 dollar range.

If the ranges are not uniform, however, this cannot be interpreted as a bar chart. Because in a histogram, what matters is the area of the bar, not the height. And with non-uniform ranges, the height is different from the area of a bar.

Histograms and Boxplots

Two Histograms

  • Two different histograms
  • Same boxplots (quartiles)
  • Histograms tell more about the distributions

Two Histograms

Quantile Plot

  • Displays all of the data, $x_i$ in increasing order
  • Plots quantile information, $f_i$ the quantile of data below
  • Assess both the overall behavior and unusual occurrences

Two Histograms

When you take a standard test, you may receive a report about your percentile -- the proportion of others with a grade below you. The quantile here is roughly the same idea. Now the the data values are plotted against the f-value, or quantiles, you can observe the overall pattern and compare to other data.

From the quantiles, you can also identify the quartiles, such as the median, Q1, and Q2. Again, you can assess and compare these values.

Quantile-Quantile (Q-Q) Plot

QQ plot:

  • Graphs quantiles of one variable vs. those of another
  • Highlights the quartiles: $Q_1$, Median, and $Q_3$
  • One variable vs. another
  • Distribution model (assumption) vs. a variable (real data)

Two Histograms

When comparing two data distributions, quantile-quantile or Q-Q plots are especially useful. A QQ plot graphs the quantiles of one univariate distribution against the corresponding quantiles of another.

The example shows the distribution of unit price at one branch vs. that in another branch. The quartiles Q1, Median and Q3 are all above the diagonal line, indicating Y (branch 2) tends to have a higher unit price than X (branch 1) does.

A QQ plot is also useful in situations where you expect a specific distribution model on your data. For example, when you use linear regression -- which assumes normal distributions -- it is necessary to examine the QQ plot of your data distribution vs. the ideal normal distribution. Certain remedies are required if your data are not normal.

Scatter plot

  • Provides a first look at bivariate data
  • Initial exploration and examination
    • Potential clusters of data points
    • Are they outliers?
    • Are the two variables correlated? In what ways?

With multiple variables, scatterplots are useful for the intial data exploration and may reveal certain relations of variables and potential data clusters and outliers.

Two Histograms

The example scatter plots here show situations where variables are positively or negatively correlated. In other data, the correlation might be more complex.

Two Histograms

Or perhaps no correlation can be observed here.

References

  • Han, Jiawei and Kamber, Micheline (2011). Data Mining: Concepts and Techniques (3rd Edition). Morgan Kaufmann Publishers, San Francisco.
  • W. Cleveland, Visualizing Data, Hobart Press, 1993
  • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
  • U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
  • L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
  • H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997
  • D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2002
  • D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
  • S.  Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999
  • E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
  • C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009