In this tutorial, we are going to use a data set about university statistics: https://www.kaggle.com/theriley106/university-statistics.
A copy of the data can be downloaded from here:
The data are in the JSON format. An example is shown here:
{
"rankingNoteText":NULL
"nonResponderText":NULL
"nonResponder":boolfalse
"act-avg":int32
"primaryPhoto":string"https://www.usnews.com/img/college-photo_31291.jpg"
"primaryPhotoThumb":string"https://www.usnews.com/img/college-photo_31291_170x170mm.jpg"
"sat-avg":int1400
"enrollment":int5400
"city":string"Princeton"
"sortName":string"princetonuniversity"
"zip":string"08544"
"rankingNoteCharacter":NULL
"acceptance-rate":int7
"rankingDisplayScore":string"100"
"percent-receiving-aid":int60
"cost-after-aid":int16793
"state":string"NJ"
"rankingSortRank":int1
"hs-gpa-avg":float3.9
"urlName":string"princeton-university"
"rankingDisplayName":string"National Universities"
"rankingDisplayRank":string"#1"
"ranking":string"overall"
"xwalkId":string"186131"
"rankingIsTied":boolfalse
"isPublic":booltrue
"businessRepScore":NULL
"tuition":int47140
"engineeringRepScore":float4.1
"displayName":string"Princeton University"
"schoolType":string"national-universities"
"region":NULL
"aliasNames":NULL
"rankingType":string"national-universities"
"overallRank":int1
"institutionalControl":string"private"
"rankingMaxPossibleScore":NULL
"rankingRankStatus":string"ranked"
"primaryKey":string"2627"
}
import json
from pprint import pprint
from pymongo import MongoClient
Connect to MongoDB with PyMongo:
- Connect to MongoDB
- Select the
education
database - Select the
schools
collection
con = MongoClient()
db = con.education
schools = db.schools
Note when MongoClient()
is called, it connect to the MongoDB server locally on the default port. You can add parameters to connect to a different server such as ((not a real server address here):
con = pymongo.MongoClient("mongodb://example.com:27017/")
db
references the education
database and schools
is the data collection in the database.
response = schools.delete_many( {} )
count = response.deleted_count
print("{0} documents have been deleted.".format(count))
with open("data/schoolInfo.json") as f:
count = 0
data = json.loads(f.read())
for school in data:
# print(school)
schools.insert_one(school)
count += 1
print("{0} documents have been inserted.".format(count))
u = schools.find_one()
pprint(u)
docs = schools.find().limit(25) # limit() to avoid too much output
for doc in docs:
print(doc["displayName"], end=", ")
query = { "state":"PA", "city":"Philadelphia" }
pa_schools = schools.find(query)
for school in pa_schools:
print(school)
query = { "state":"PA", "city":"Philadelphia" }
pa_schools = schools.find(query, {"_id":0, "displayName":1, "rankingDisplayRank":1, "enrollment":1})
for school in pa_schools:
print(school)
query = { "state":"PA", "city":"Philadelphia" }
fields = {"_id":0, "displayName":1, "rankingDisplayRank":1, "enrollment":1}
pa_schools = schools.find(query, fields).sort("enrollment",-1)
for school in pa_schools:
print(school)
pipeline = [
{"$group":
{
"_id": "$state",
"count":{"$sum": 1}
}
}
]
states = schools.aggregate(pipeline)
for state in states:
print(state)
Here the groups are based on (identified by) the state
field. To count the number of documents for each group is to add $1$ to the sum for each document, hence "{$sum":1}
.
We can add sort
and limit
to the pipeline to pick the top states:
pipeline = [
{"$group":
{
"_id": "$state",
"count":{"$sum": 1},
"sum_enroll":{"$sum": "$enrollment"},
"avg_tuition":{"$avg":"$tuition"}
}
},
{"$sort": {"count":-1} },
{"$limit": 10}
]
states = schools.aggregate(pipeline)
for state in states:
print(state)
Here we sort the results in the descending order of count
and limit the list to top 10.
More on Pipeline
Pipeline, a series of operators to filter or transform data:
$match
filters documents$group
groups documents by an id and can use all the computational expressions- for example:
$max
,$min
- for example:
$unwind
operates on an array field, yield documents for each array value- also complements
$match
and$group
- also complements
$sort
sorts documents by one or more fields$skip
skips over documents in the pipeline$limit
restricts the number of documents in an aggregation pipeline
Expressions produce documents based on calculations performed on input documents:
$max
returns the highest value in the group$min
returns the lowest value in the group$avg
returns the average of all the group values$sum
returns the sum of all values in the group$addToSet
returns an array of all the distinct values for a certain field in each document in that group
Operators in expressions:
- Boolean:
$and
,$or
, and$not
- Arithmetic:
$add
,$divide
,$mod
,$multiply
, and$substract
- String:
$concat
,$substr
,$toUpper
,$toLower
, and$strcasecmp
- Conditional:
$cond
and$ifNull
References
- Chapter 12 Data Processing and Aggregation with MongoDB, of Hector Cuesta (2013). Practical Data Analysis. https://ebookcentral-proquest-com.ezproxy2.library.drexel.edu/lib/drexel-ebooks/detail.action?docID=1507840