Search
Demo: Text Vectorization

Objectives

This exercise is to process a small collection of course titles (as text documents) and compute the following statistics:

  1. Term Frequencies (TF) of terms in each document.
  2. Inverse Document Frequencies (IDF) of terms in the entire collection.

1. Data

We have a small collection of 3 course titles:

1 Information and System
2 Data and Information
3 System and System Programming

2. Create a Course class

class Course: 
    def __init__(self, course_id):
        # create a new course with an ID
        self.id = course_id
        # create an empty dictionary 
        # to hold term frequency (TF) counts
        self.tfs = {}
    
    def count_words(self, text):
        # split a title into words, 
        # using space " " as delimiter
        words = text.lower().split(" ")
        for word in words: 
            # for each word in the list
            if word in self.tfs: 
                # if it has been counted in the TF dictionary
                # add 1 to the count
                self.tfs[word] = self.tfs[word] + 1
            else:
                # if it has not been counted, 
                # initialize its TF with 1
                self.tfs[word] = 1

NOW OUTSIDE THE COURSE CLASS

3. Create a print_dictionary function

def print_dictionary(dict_data):
    # print all key-value pairs in a dictionary
    for key in dict_data: 
        print(key, dict_data[key])

4. Steps to Vectorize the Course Titles

4.1 Create an empty list of courses

courses = []

4.2 Create a list of the 3 course titles

titles = ["Information and System", "Data and Information", "System and System Programming"]

4.3 Use a loop to process data

  1. Create a Course object for each title;
  2. Process the title and run the TF counts;
  3. Add the Course object (with TF statistics) to the list courses.
for i in range(0, len(titles)):
    title = titles[i]
    
    # create a new course with an ID
    course = Course(i+1)
    
    # process title and compute term frequencies (TF)
    course.count_words(title)
    
    # add the course to the list
    courses.append(course)

4.4 Create an empty dictionary to keep track of DF values

dfs = {}

4.5 Go through the Courses' TF dictionaries to get DF values

for course in courses: 
    for word in course.tfs: 
        # add 1 to DF count if the word appears in a doc (TF)
        dfs[word] = dfs.get(word,0) + 1

4.6 Print DF (Document Frequency) Statistics

print("DF statistics: ")
print("==============")
print_dictionary(dfs)
DF statistics: 
==============
information 2
and 3
system 2
data 1
programming 1

4.7 Print TF (Term Frequency) Statistics

print("TF statistics: ")
print("==============")
for course in courses: 
    print("COURSE #{0}".format(course.id))
    print("-------------")
    print_dictionary(course.tfs)
    print("")
TF statistics: 
==============
COURSE #1
-------------
information 1
and 1
system 1

COURSE #2
-------------
data 1
and 1
information 1

COURSE #3
-------------
system 2
and 1
programming 1