class Document:
"""
The Document class to be implemented.
Please define all required methods
within the class structure.
"""
def __init__(self, doc_id):
"""to be implemented"""
def tokenize(text):
"""to be implemented"""
The Document
class should have:
A
__init__(self, doc_id)
method as the constructor method- Assign
doc_id
toself.id
Create an empty dictionary variable with
self.variable_name
Later you will keep track of all unique words and their frequencies in the document.
- Assign
A
tokenize(text)
method that:- Splits text into single words using
space
andpunctuation
as delimiter; - Use a loop to go through all the words, and for each word:
- Convert the word to lower case using the
lower()
method of a string; - If it does not appear in the dictionary, add it to the dictionary and set its count/frequency to 1;
- If it is already in the dictionary, increment its count/frequency by adding 1 to it;
- Convert the word to lower case using the
- Splits text into single words using
Hint: See the lecture on data structures and follow the radish vote couting example.
def save_dictionary(dict_data, file_path_name):
"""Saves dictionary data to a text file.
Arguments:
dict_data: a dictionary with key-value pairs
file_path_name: the full path name of a text file (output)
"""
The function should accept two arguments:
- One argument for the dictionary with data to be saved;
- Second argument about the file pathname to save the data;
The function saves all data/statistics in the dictionary to text files, with each key-value pair in one text line separated by a tab ("\t").
The output file should look like:
Key1 value1
Key2 value2
Key3 value3
.. ..
def vectorize(data_path):
"""Vectorizes text files in the path.
Arguments:
data_path: the path to the directory (folder) where text files are
"""
The function should:
- Take a string argument as the path (folder) to where the text data files are;
- Process all data files in the path and produces TF and DF statistics;
Here are steps in the function:
Create a dictionary variable (this is a global dictionary variable outside the class) to keep track all unique words and their DF (document frequency) values;
Create a loop to go through every .txt files in the path argument. File names should be:
1.txt
,2.txt
, ...,20.txt
.For each .txt file (within the loop):
Create a Document object (based on the Document class) using the filename as
doc_id
parameter. For example, doc_id should be2
for2.txt
.Read the content (text lines) from the text file.
Call the document object's
.tokenize()
function to process the text content.Call the
save_dictionary()
function to save the document's dictionary with TF (term frequencies) to a file, where the filename should betf_DOCID.txt
in the same path.The TF dictionary can be accessed via the document object's
.dictionary_variable_name
.For example, after processing 1.txt file, the data should be saved to
tf_1.txt
file in the same directory.
Create a nested loop (a smaller/indented loop inside the above loop), and for each word in the document's dictionary:
- If it does not appear in the dictionary for DF (the dictionary variable outside class), then add the word to the DF dictionary;
- If it is already in the DF dictionary, increment its DF value by adding 1 to itself;
After all files are processed (OUTSIDE the BIG LOOP above), call the
save_dictionary()
function again to save the DF dictionary to a file nameddf.txt
in the same path with the input text files.