Demo: Text Vectorization

Objectives

This exercise is to process a small collection of course titles (as text documents) and compute the following statistics:

Term Frequencies (TF) of terms in each document.
Inverse Document Frequencies (IDF) of terms in the entire collection.

1. Data

We have a small collection of 3 course titles:

1 Information and System
2 Data and Information
3 System and System Programming

2. Create a Course class

class Course: 
    def __init__(self, course_id):
        # create a new course with an ID
        self.id = course_id
        # create an empty dictionary 
        # to hold term frequency (TF) counts
        self.tfs = {}
    
    def count_words(self, text):
        # split a title into words, 
        # using space " " as delimiter
        words = text.lower().split(" ")
        for word in words: 
            # for each word in the list
            if word in self.tfs: 
                # if it has been counted in the TF dictionary
                # add 1 to the count
                self.tfs[word] = self.tfs[word] + 1
            else:
                # if it has not been counted, 
                # initialize its TF with 1
                self.tfs[word] = 1

NOW OUTSIDE THE COURSE CLASS

3. Create a `print_dictionary` function

def print_dictionary(dict_data):
    # print all key-value pairs in a dictionary
    for key in dict_data: 
        print(key, dict_data[key])

4. Steps to Vectorize the Course Titles

4.1 Create an empty list of courses

courses = []

4.2 Create a list of the 3 course titles

titles = ["Information and System", "Data and Information", "System and System Programming"]

4.3 Use a loop to process data

Create a Course object for each title;
Process the title and run the TF counts;
Add the Course object (with TF statistics) to the list courses.

for i in range(0, len(titles)):
    title = titles[i]
    
    # create a new course with an ID
    course = Course(i+1)
    
    # process title and compute term frequencies (TF)
    course.count_words(title)
    
    # add the course to the list
    courses.append(course)

4.4 Create an empty dictionary to keep track of DF values

dfs = {}

4.5 Go through the Courses' TF dictionaries to get DF values

for course in courses: 
    for word in course.tfs: 
        # add 1 to DF count if the word appears in a doc (TF)
        dfs[word] = dfs.get(word,0) + 1

4.6 Print DF (Document Frequency) Statistics

print("DF statistics: ")
print("==============")
print_dictionary(dfs)

DF statistics: 
==============
information 2
and 3
system 2
data 1
programming 1

4.7 Print TF (Term Frequency) Statistics

print("TF statistics: ")
print("==============")
for course in courses: 
    print("COURSE #{0}".format(course.id))
    print("-------------")
    print_dictionary(course.tfs)
    print("")

TF statistics: 
==============
COURSE #1
-------------
information 1
and 1
system 1

COURSE #2
-------------
data 1
and 1
information 1

COURSE #3
-------------
system 2
and 1
programming 1

Objectives

1. Data

2. Create a Course class

NOW OUTSIDE THE COURSE CLASS

3. Create a print_dictionary function

4. Steps to Vectorize the Course Titles

4.1 Create an empty list of courses

4.2 Create a list of the 3 course titles

4.3 Use a loop to process data

4.4 Create an empty dictionary to keep track of DF values

4.5 Go through the Courses' TF dictionaries to get DF values

4.6 Print DF (Document Frequency) Statistics

4.7 Print TF (Term Frequency) Statistics

3. Create a `print_dictionary` function