Search
SQLite with Python

SQLite Python

SQLite with Python

  • SQLite: File-based relational SQL database
  • Python: A popular language in data science, etc.
  • SQLite is well supported in Python

This tutorial will follow the same college.db example in the SQLite presentation.

Python Code for SQLite

Import SQLite

import sqlite3

Load SQLite Database

To open and connect to the database file college.db:

db = 'college.db'
conn = sqlite3.connect(db)
c = conn.cursor()

We can now create the two tables:

  1. Programs with ID, program name, and description;
  2. Students with ID, name, program info, and class year.

using SQL statements with the SQLite3 interactive.

programs table

Create the programs table structure:

c.execute("drop table if exists programs")
c.execute("""create table programs (
                program_id varchar(10) not null primary key, 
                program_name varchar(20), 
                program_desc varchar(500)
            )""")
conn.commit()

Now we can start adding students data:

insert_sql = "insert into programs (program_id, program_name, program_desc) values (?, ?, ?)"

c.execute(insert_sql, ('IS', 'Information Systems', 'Information systems with a business angle') )
c.execute(insert_sql, ('DS', 'Data Science', 'New data science program') ) 
c.execute(insert_sql, ('CS', 'Computer Science', 'Traditional computer science') ) 
c.execute(insert_sql, ('CST', 'Computing & Security Technology', 'CST with a new focus on security') )

conn.commit()

Now let's take a look at data in the table:

students table

Create the students table:

c.execute("drop table if exists students")
c.execute("""
    create table students (
        student_id integer primary key, 
        student_name varchar(100), 
        program_id varchar(10),
        class_year int
    )
""")
conn.commit()

For the students tables, we have preppared a csv file with students records, which can be downloaded from: students.csv.

Suppose data file students.csv is in the data sub-directory (folder):

csv
1   John Smith,        IS,  2020
2   Albert Einstein,   DS   2021
3   George Washington, CS,  2019
4   Donald Trump,      CST, 2020
5   Barack Obama,      IS,  2021
.   ...                ..   ....

Let read the CSV file and load data into the students table:

import csv

# remove all student data first
delete_sql = "delete from students"
c.execute(delete_sql)

# load CSV data
f=open("data/students.csv")
insert_sql = "insert into students values (?, ?, ?, ?)"
for row in csv.reader(f):
    c.execute(insert_sql, row)
    
conn.commit()

Now check out the data:

c.execute("select * from students limit 10")
rows = c.fetchall()
for row in rows: 
    print(row)
(1, 'John Smith', 'IS', 2020)
(2, 'Albert Einstein', 'DS', 2021)
(3, 'George Washington', 'CS', 2019)
(4, 'Donald Trump', 'CST', 2020)
(5, 'Barack Obama', 'IS', 2021)
(6, 'Mark Twain', 'DS', 2020)
(7, 'Joey Binzinger', 'IS', 2023)
(8, 'Kanye West', 'DS', 2040)
(9, 'Felix Monika', 'CST', 2012)
(10, 'Noble Pearl', 'IS', 2015)

Now let's count the number of records in the table:

c.execute("select count(*) from students")
rows = c.fetchall()
for row in rows: 
    print("The total number of students:", row[0])
The total number of students: 50

Group Aggregates

Perhaps, you only want to know statistics in each program:

query = """select program_id, 
       count(*) as student_count, 
       min(class_year) as first_class, 
       max(class_year) as last_class
from students 
group by program_id
order by student_count DESC
"""
c.execute(query)
rows = c.fetchall()
for row in rows: 
    print(row[0], row[1], row[2], row[3])
IS 14 2015 2023
CST 13 2012 2022
CS 11 2010 2030
DS 8 2020 2030
IT 4 2018 2020

Update Student Data

Now that you see from the statistics some data values are unrealistic such as a class year $>2030$. Perhaps we can reset all these values to $2030$ if they are greater than that:

update_sql = "update students set class_year = 2030 where class_year > 2030"
c.execute(update_sql)
conn.commit()

We can run the statistics again to see if data have been updated:

from pprint import pprint

query = """select program_id, 
       count(*) as student_count, 
       min(class_year) as first_class, 
       max(class_year) as last_class
from students 
group by program_id
order by student_count DESC
"""
c.execute(query)
rows = c.fetchall()
pprint(rows)
[('IS', 14, 2015, 2023),
 ('CST', 13, 2012, 2022),
 ('CS', 11, 2010, 2030),
 ('DS', 8, 2020, 2030),
 ('IT', 4, 2018, 2020)]

Join Tables

Suppose you want to retrieve student data with their corresponding program information (e.g. program name):

query = """
    select student_name, program_name, class_year
    from students s join programs p 
         on s.program_id = p.program_id
    limit 10
"""
c.execute(query)
rows = c.fetchall()
for row in rows: 
    print("{:s} {:s} {:d}".format(row[0].ljust(20), row[1].ljust(40), row[2]))
John Smith           Information Systems                      2020
Albert Einstein      Data Science                             2021
George Washington    Computer Science                         2019
Donald Trump         Computing & Security Technology          2020
Barack Obama         Information Systems                      2021
Mark Twain           Data Science                             2020
Joey Binzinger       Information Systems                      2023
Kanye West           Data Science                             2030
Felix Monika         Computing & Security Technology          2012
Noble Pearl          Information Systems                      2015

References