Search
Python NoSQL Processing

MongoDB Python

Data

In this tutorial, we are going to use a data set about university statistics: https://www.kaggle.com/theriley106/university-statistics.

A copy of the data can be downloaded from here:

schoolInfo.json.

The data are in the JSON format. An example is shown here:

{
    "rankingNoteText":NULL
    "nonResponderText":NULL
    "nonResponder":boolfalse
    "act-avg":int32
    "primaryPhoto":string"https://www.usnews.com/img/college-photo_31291.jpg"
    "primaryPhotoThumb":string"https://www.usnews.com/img/college-photo_31291_170x170mm.jpg"
    "sat-avg":int1400
    "enrollment":int5400
    "city":string"Princeton"
    "sortName":string"princetonuniversity"
    "zip":string"08544"
    "rankingNoteCharacter":NULL
    "acceptance-rate":int7
    "rankingDisplayScore":string"100"
    "percent-receiving-aid":int60
    "cost-after-aid":int16793
    "state":string"NJ"
    "rankingSortRank":int1
    "hs-gpa-avg":float3.9
    "urlName":string"princeton-university"
    "rankingDisplayName":string"National Universities"
    "rankingDisplayRank":string"#1"
    "ranking":string"overall"
    "xwalkId":string"186131"
    "rankingIsTied":boolfalse
    "isPublic":booltrue
    "businessRepScore":NULL
    "tuition":int47140
    "engineeringRepScore":float4.1
    "displayName":string"Princeton University"
    "schoolType":string"national-universities"
    "region":NULL
    "aliasNames":NULL
    "rankingType":string"national-universities"
    "overallRank":int1
    "institutionalControl":string"private"
    "rankingMaxPossibleScore":NULL
    "rankingRankStatus":string"ranked"
    "primaryKey":string"2627"
}

PyMongo

With Python:

  • PyMongo module to access MongoDB
  • json module to process data
import json
from pprint import pprint
from pymongo import MongoClient

Connect to MongoDB with PyMongo:

  1. Connect to MongoDB
  2. Select the education database
  3. Select the schools collection
con = MongoClient()
db = con.education
schools = db.schools

Note when MongoClient() is called, it connect to the MongoDB server locally on the default port. You can add parameters to connect to a different server such as ((not a real server address here):

con = pymongo.MongoClient("mongodb://example.com:27017/")

db references the education database and schools is the data collection in the database.

Delete all existing documents if any

response = schools.delete_many( {} )
count = response.deleted_count
print("{0} documents have been deleted.".format(count))
0 documents have been deleted.

Load data from JSON file

with open("data/schoolInfo.json") as f:
    count = 0
    data = json.loads(f.read())
    for school in data:
        # print(school)
        schools.insert_one(school)
        count += 1
    print("{0} documents have been inserted.".format(count))
311 documents have been inserted.

Find One

Find the first doucment in the collection:

u = schools.find_one()
pprint(u)
{'_id': ObjectId('5eb9eb5eaac3d1b1d9457617'),
 'acceptance-rate': 7,
 'act-avg': 32,
 'aliasNames': None,
 'businessRepScore': None,
 'city': 'Princeton',
 'cost-after-aid': 16793.0,
 'displayName': 'Princeton University',
 'engineeringRepScore': 4.1,
 'enrollment': 5400,
 'hs-gpa-avg': 3.9,
 'institutionalControl': 'private',
 'isPublic': True,
 'nonResponder': False,
 'nonResponderText': None,
 'overallRank': 1,
 'percent-receiving-aid': 60,
 'primaryKey': '2627',
 'primaryPhoto': 'https://www.usnews.com/img/college-photo_31291.jpg',
 'primaryPhotoThumb': 'https://www.usnews.com/img/college-photo_31291_170x170mm.jpg',
 'ranking': 'overall',
 'rankingDisplayName': 'National Universities',
 'rankingDisplayRank': '#1',
 'rankingDisplayScore': '100',
 'rankingIsTied': False,
 'rankingMaxPossibleScore': None,
 'rankingNoteCharacter': None,
 'rankingNoteText': None,
 'rankingRankStatus': 'ranked',
 'rankingSortRank': 1,
 'rankingType': 'national-universities',
 'region': None,
 'sat-avg': 1400,
 'schoolType': 'national-universities',
 'sortName': 'princetonuniversity',
 'state': 'NJ',
 'tuition': 47140,
 'urlName': 'princeton-university',
 'xwalkId': '186131',
 'zip': '08544'}

Find All

Find all documents (schools) in the collection and show their names:

docs = schools.find().limit(25)  # limit() to avoid too much output
for doc in docs: 
    print(doc["displayName"], end=", ")
Princeton University, Harvard University, University of Chicago, Yale University, Columbia University, Massachusetts Institute of Technology, Stanford University, University of Pennsylvania, Duke University, California Institute of Technology, Dartmouth College, Johns Hopkins University, Northwestern University, Brown University, Cornell University, Rice University, Vanderbilt University, University of Notre Dame, Washington University in St. Louis, Georgetown University, Emory University, University of California--Berkeley, University of California--Los Angeles, University of Southern California, Carnegie Mellon University, 

Find with query

Find schools in Philadelphia:

query = { "state":"PA", "city":"Philadelphia" }
pa_schools = schools.find(query)
for school in pa_schools: 
    print(school)
{'_id': ObjectId('5eb9eb5eaac3d1b1d945761e'), 'rankingNoteText': None, 'nonResponderText': None, 'nonResponder': False, 'act-avg': 32, 'primaryPhoto': 'https://www.usnews.com/img/college-photo_2743.jpg', 'primaryPhotoThumb': 'https://www.usnews.com/img/college-photo_2743_170x170mm.jpg', 'sat-avg': 1380, 'enrollment': 10019, 'city': 'Philadelphia', 'sortName': 'universityofpennsylvania', 'zip': '19104', 'rankingNoteCharacter': None, 'acceptance-rate': 9, 'rankingDisplayScore': '93', 'percent-receiving-aid': 47, 'cost-after-aid': 25441.0, 'state': 'PA', 'rankingSortRank': 8, 'hs-gpa-avg': 3.9, 'urlName': 'university-of-pennsylvania', 'rankingDisplayName': 'National Universities', 'rankingDisplayRank': '#8', 'ranking': 'overall', 'xwalkId': '215062', 'rankingIsTied': False, 'isPublic': True, 'businessRepScore': 4.8, 'tuition': 53534, 'engineeringRepScore': 3.6, 'displayName': 'University of Pennsylvania', 'schoolType': 'national-universities', 'region': None, 'aliasNames': 'UPenn', 'rankingType': 'national-universities', 'overallRank': 8, 'institutionalControl': 'private', 'rankingMaxPossibleScore': None, 'rankingRankStatus': 'ranked', 'primaryKey': '3378'}
{'_id': ObjectId('5eb9eb5eaac3d1b1d9457674'), 'rankingNoteText': None, 'nonResponderText': None, 'nonResponder': False, 'act-avg': 25, 'primaryPhoto': 'https://www.usnews.com/img/college-photo_12760.jpg', 'primaryPhotoThumb': 'https://www.usnews.com/img/college-photo_12760_170x170mm.jpg', 'sat-avg': 1080, 'enrollment': 13251, 'city': 'Philadelphia', 'sortName': 'drexeluniversity', 'zip': '19104', 'rankingNoteCharacter': None, 'acceptance-rate': 75, 'rankingDisplayScore': '51', 'percent-receiving-aid': 54, 'cost-after-aid': 42176.0, 'state': 'PA', 'rankingSortRank': 94, 'hs-gpa-avg': 3.7, 'urlName': 'drexel-university', 'rankingDisplayName': 'National Universities', 'rankingDisplayRank': '#94', 'ranking': 'overall', 'xwalkId': '212054', 'rankingIsTied': True, 'isPublic': True, 'businessRepScore': 2.8, 'tuition': 52002, 'engineeringRepScore': 3.0, 'displayName': 'Drexel University', 'schoolType': 'national-universities', 'region': None, 'aliasNames': None, 'rankingType': 'national-universities', 'overallRank': 94, 'institutionalControl': 'private', 'rankingMaxPossibleScore': None, 'rankingRankStatus': 'ranked', 'primaryKey': '3256'}
{'_id': ObjectId('5eb9eb5eaac3d1b1d945768b'), 'rankingNoteText': None, 'nonResponderText': None, 'nonResponder': False, 'act-avg': 24, 'primaryPhoto': 'https://www.usnews.com/img/college-photo_21951.jpg', 'primaryPhotoThumb': 'https://www.usnews.com/img/college-photo_21951_170x170mm.jpg', 'sat-avg': 1050, 'enrollment': 29275, 'city': 'Philadelphia', 'sortName': 'templeuniversity', 'zip': '19122', 'rankingNoteCharacter': None, 'acceptance-rate': 52, 'rankingDisplayScore': '47', 'percent-receiving-aid': 37, 'cost-after-aid': 35109.0, 'state': 'PA', 'rankingSortRank': 115, 'hs-gpa-avg': 3.6, 'urlName': 'temple-university', 'rankingDisplayName': 'National Universities', 'rankingDisplayRank': '#115', 'ranking': 'overall', 'xwalkId': '216339', 'rankingIsTied': True, 'isPublic': True, 'businessRepScore': 3.2, 'tuition': 27266, 'engineeringRepScore': 2.2, 'displayName': 'Temple University', 'schoolType': 'national-universities', 'region': None, 'aliasNames': None, 'rankingType': 'national-universities', 'overallRank': 115, 'institutionalControl': 'public', 'rankingMaxPossibleScore': None, 'rankingRankStatus': 'ranked', 'primaryKey': '3371'}

Find with query and return some fields

Suppose we are only interested in three fields displayName, ranking, and enrollment:

query = { "state":"PA", "city":"Philadelphia" }
pa_schools = schools.find(query, {"_id":0, "displayName":1, "rankingDisplayRank":1, "enrollment":1})
for school in pa_schools: 
    print(school)
{'enrollment': 10019, 'rankingDisplayRank': '#8', 'displayName': 'University of Pennsylvania'}
{'enrollment': 13251, 'rankingDisplayRank': '#94', 'displayName': 'Drexel University'}
{'enrollment': 29275, 'rankingDisplayRank': '#115', 'displayName': 'Temple University'}

Find and sort

Let's sort the Philadelphia schools in the descending order of enrollment:

query = { "state":"PA", "city":"Philadelphia" }
fields = {"_id":0, "displayName":1, "rankingDisplayRank":1, "enrollment":1}
pa_schools = schools.find(query, fields).sort("enrollment",-1)
for school in pa_schools: 
    print(school)
{'enrollment': 29275, 'rankingDisplayRank': '#115', 'displayName': 'Temple University'}
{'enrollment': 13251, 'rankingDisplayRank': '#94', 'displayName': 'Drexel University'}
{'enrollment': 10019, 'rankingDisplayRank': '#8', 'displayName': 'University of Pennsylvania'}

Group and Aggregation

We can perform aggregates such as counts in groups.

pipeline = [
    {"$group":
        {
            "_id": "$state", 
            "count":{"$sum": 1}
        }
    }
]
states = schools.aggregate(pipeline)
for state in states: 
    print(state)
{'_id': 'NV', 'count': 2}
{'_id': 'TX', 'count': 24}
{'_id': 'OH', 'count': 14}
{'_id': 'NC', 'count': 9}
{'_id': 'DE', 'count': 2}
{'_id': 'IL', 'count': 13}
{'_id': 'OK', 'count': 3}
{'_id': 'SC', 'count': 2}
{'_id': 'AL', 'count': 5}
{'_id': 'WV', 'count': 1}
{'_id': 'IN', 'count': 6}
{'_id': 'GA', 'count': 11}
{'_id': 'LA', 'count': 6}
{'_id': 'NY', 'count': 21}
{'_id': 'MI', 'count': 9}
{'_id': 'UT', 'count': 3}
{'_id': 'VT', 'count': 1}
{'_id': 'DC', 'count': 5}
{'_id': 'VA', 'count': 9}
{'_id': 'CA', 'count': 27}
{'_id': 'HI', 'count': 1}
{'_id': 'AR', 'count': 2}
{'_id': 'ME', 'count': 1}
{'_id': 'WY', 'count': 1}
{'_id': 'RI', 'count': 2}
{'_id': 'PA', 'count': 13}
{'_id': 'KS', 'count': 3}
{'_id': 'CT', 'count': 3}
{'_id': 'ND', 'count': 2}
{'_id': 'TN', 'count': 10}
{'_id': 'IA', 'count': 2}
{'_id': 'NH', 'count': 2}
{'_id': 'CO', 'count': 6}
{'_id': 'AZ', 'count': 6}
{'_id': 'MN', 'count': 4}
{'_id': 'NJ', 'count': 8}
{'_id': 'WA', 'count': 3}
{'_id': 'KY', 'count': 4}
{'_id': 'NE', 'count': 2}
{'_id': 'ID', 'count': 3}
{'_id': 'MD', 'count': 5}
{'_id': 'MT', 'count': 2}
{'_id': 'MO', 'count': 8}
{'_id': 'AK', 'count': 1}
{'_id': 'FL', 'count': 12}
{'_id': 'WI', 'count': 5}
{'_id': 'OR', 'count': 3}
{'_id': 'MS', 'count': 4}
{'_id': 'MA', 'count': 16}
{'_id': 'NM', 'count': 2}
{'_id': 'SD', 'count': 2}

Here the groups are based on (identified by) the state field. To count the number of documents for each group is to add $1$ to the sum for each document, hence "{$sum":1}.

We can add sort and limit to the pipeline to pick the top states:

pipeline = [
    {"$group":
        {
            "_id": "$state", 
            "count":{"$sum": 1}, 
            "sum_enroll":{"$sum": "$enrollment"},
            "avg_tuition":{"$avg":"$tuition"}
        }
    }, 
    {"$sort": {"count":-1} }, 
    {"$limit": 10}
]
states = schools.aggregate(pipeline)
for state in states: 
    print(state)
{'_id': 'CA', 'count': 27, 'sum_enroll': 382462, 'avg_tuition': 36503.81481481482}
{'_id': 'TX', 'count': 24, 'sum_enroll': 455150, 'avg_tuition': 27663.708333333332}
{'_id': 'NY', 'count': 21, 'sum_enroll': 215554, 'avg_tuition': 40859.42857142857}
{'_id': 'MA', 'count': 16, 'sum_enroll': 133298, 'avg_tuition': 42403.1875}
{'_id': 'OH', 'count': 14, 'sum_enroll': 227614, 'avg_tuition': 24363.928571428572}
{'_id': 'PA', 'count': 13, 'sum_enroll': 157966, 'avg_tuition': 38954.153846153844}
{'_id': 'IL', 'count': 13, 'sum_enroll': 144816, 'avg_tuition': 33293.53846153846}
{'_id': 'FL', 'count': 12, 'sum_enroll': 266100, 'avg_tuition': 26365.833333333332}
{'_id': 'GA', 'count': 11, 'sum_enroll': 158794, 'avg_tuition': 27488.18181818182}
{'_id': 'TN', 'count': 10, 'sum_enroll': 100680, 'avg_tuition': 28825.8}

Here we sort the results in the descending order of count and limit the list to top 10.

More on Pipeline

Pipeline, a series of operators to filter or transform data:

  • $match filters documents
  • $group groups documents by an id and can use all the computational expressions
    • for example: $max, $min
  • $unwind operates on an array field, yield documents for each array value
    • also complements $match and $group
  • $sort sorts documents by one or more fields
  • $skip skips over documents in the pipeline
  • $limit restricts the number of documents in an aggregation pipeline

Expressions produce documents based on calculations performed on input documents:

  • $max returns the highest value in the group
  • $min returns the lowest value in the group
  • $avg returns the average of all the group values
  • $sum returns the sum of all values in the group
  • $addToSet returns an array of all the distinct values for a certain field in each document in that group

Operators in expressions:

  • Boolean: $and, $or, and $not
  • Arithmetic: $add, $divide, $mod, $multiply, and $substract
  • String: $concat, $substr, $toUpper, $toLower, and $strcasecmp
  • Conditional: $cond and $ifNull

References