In this tutorial, we are going to use a data set about university statistics: https://www.kaggle.com/theriley106/university-statistics.
A copy of the data can be downloaded from here:
The data are in the JSON format. An example is shown here:
{
"rankingNoteText":NULL
"nonResponderText":NULL
"nonResponder":boolfalse
"act-avg":int32
"primaryPhoto":string"https://www.usnews.com/img/college-photo_31291.jpg"
"primaryPhotoThumb":string"https://www.usnews.com/img/college-photo_31291_170x170mm.jpg"
"sat-avg":int1400
"enrollment":int5400
"city":string"Princeton"
"sortName":string"princetonuniversity"
"zip":string"08544"
"rankingNoteCharacter":NULL
"acceptance-rate":int7
"rankingDisplayScore":string"100"
"percent-receiving-aid":int60
"cost-after-aid":int16793
"state":string"NJ"
"rankingSortRank":int1
"hs-gpa-avg":float3.9
"urlName":string"princeton-university"
"rankingDisplayName":string"National Universities"
"rankingDisplayRank":string"#1"
"ranking":string"overall"
"xwalkId":string"186131"
"rankingIsTied":boolfalse
"isPublic":booltrue
"businessRepScore":NULL
"tuition":int47140
"engineeringRepScore":float4.1
"displayName":string"Princeton University"
"schoolType":string"national-universities"
"region":NULL
"aliasNames":NULL
"rankingType":string"national-universities"
"overallRank":int1
"institutionalControl":string"private"
"rankingMaxPossibleScore":NULL
"rankingRankStatus":string"ranked"
"primaryKey":string"2627"
}
import json
from pprint import pprint
from pymongo import MongoClient
Connect to MongoDB with PyMongo:
- Connect to MongoDB
- Select the
educationdatabase - Select the
schoolscollection
con = MongoClient()
db = con.education
schools = db.schools
Note when MongoClient() is called, it connect to the MongoDB server locally on the default port. You can add parameters to connect to a different server such as ((not a real server address here):
con = pymongo.MongoClient("mongodb://example.com:27017/")
db references the education database and schools is the data collection in the database.
response = schools.delete_many( {} )
count = response.deleted_count
print("{0} documents have been deleted.".format(count))
with open("data/schoolInfo.json") as f:
count = 0
data = json.loads(f.read())
for school in data:
# print(school)
schools.insert_one(school)
count += 1
print("{0} documents have been inserted.".format(count))
u = schools.find_one()
pprint(u)
docs = schools.find().limit(25) # limit() to avoid too much output
for doc in docs:
print(doc["displayName"], end=", ")
query = { "state":"PA", "city":"Philadelphia" }
pa_schools = schools.find(query)
for school in pa_schools:
print(school)
query = { "state":"PA", "city":"Philadelphia" }
pa_schools = schools.find(query, {"_id":0, "displayName":1, "rankingDisplayRank":1, "enrollment":1})
for school in pa_schools:
print(school)
query = { "state":"PA", "city":"Philadelphia" }
fields = {"_id":0, "displayName":1, "rankingDisplayRank":1, "enrollment":1}
pa_schools = schools.find(query, fields).sort("enrollment",-1)
for school in pa_schools:
print(school)
pipeline = [
{"$group":
{
"_id": "$state",
"count":{"$sum": 1}
}
}
]
states = schools.aggregate(pipeline)
for state in states:
print(state)
Here the groups are based on (identified by) the state field. To count the number of documents for each group is to add $1$ to the sum for each document, hence "{$sum":1}.
We can add sort and limit to the pipeline to pick the top states:
pipeline = [
{"$group":
{
"_id": "$state",
"count":{"$sum": 1},
"sum_enroll":{"$sum": "$enrollment"},
"avg_tuition":{"$avg":"$tuition"}
}
},
{"$sort": {"count":-1} },
{"$limit": 10}
]
states = schools.aggregate(pipeline)
for state in states:
print(state)
Here we sort the results in the descending order of count and limit the list to top 10.
More on Pipeline
Pipeline, a series of operators to filter or transform data:
$matchfilters documents$groupgroups documents by an id and can use all the computational expressions- for example:
$max,$min
- for example:
$unwindoperates on an array field, yield documents for each array value- also complements
$matchand$group
- also complements
$sortsorts documents by one or more fields$skipskips over documents in the pipeline$limitrestricts the number of documents in an aggregation pipeline
Expressions produce documents based on calculations performed on input documents:
$maxreturns the highest value in the group$minreturns the lowest value in the group$avgreturns the average of all the group values$sumreturns the sum of all values in the group$addToSetreturns an array of all the distinct values for a certain field in each document in that group
Operators in expressions:
- Boolean:
$and,$or, and$not - Arithmetic:
$add,$divide,$mod,$multiply, and$substract - String:
$concat,$substr,$toUpper,$toLower, and$strcasecmp - Conditional:
$condand$ifNull
References
- Chapter 12 Data Processing and Aggregation with MongoDB, of Hector Cuesta (2013). Practical Data Analysis. https://ebookcentral-proquest-com.ezproxy2.library.drexel.edu/lib/drexel-ebooks/detail.action?docID=1507840