TDM 40100: Project 4 — 2022
Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3
database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3
, with the opportunity to use other skills you’ve built throughout the previous years.
Context: In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3
— a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3
-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.
Scope: sqlite3
, lmod, SQL
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json
Questions
This project is going to be a bit more open. The goal of this project is to take our dataset sample, and write Python code to insert it into our sqlite3
database. There are a variety of ways this could be accomplished, and we will accept anything that works, with a few constraints.
In the next project, we will run some experiments that will time insertion, project the time it would take to insert all of the data, adjust database settings, and, ultimately, create a final product that we can feel good about.
Question 1
As mentioned earlier — the goal of this project is to insert the sample data into our database. Start by generating the sample data.
%%bash
rm -rf $HOME/goodreads_samples
mkdir $HOME/goodreads_samples
cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/
cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/
shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json
shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json
In addition, go ahead and copy our empty database that is ready for you to insert data into.
%%bash
rm $HOME/my.db
cp /anvil/projects/tdm/data/goodreads/my.db $HOME
You can run this as many times as you need in order to get a fresh start.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Write Python code that inserts the data into your database. Here are the constraints.
-
You should be able to fully recover the
book_cover
image from the database. This means you’ll need to handle scraping theimage_url
and converting the image tobytes
before inserting into the database.Want some help to write the scraping code? Check out this 30100 question for more guidance.
-
Your functions and code should ultimately operate on a single row of the datasets. For instance:
import json with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f: for line in f: print(line) parsed = json.loads(line) print(f"{parsed['isbn']=}") print(f"{parsed['num_pages']=}") break
Here, you can see that we can take a single row and do something to it. Why do we want it to work this way? This makes it easy to break our dataset into chunks and perform operations in parallel, if we so choose (and we will, but not in this project).
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Demonstrate your database works by doing the following.
-
Fully recover a
book_cover
and display it in your notebook.%%bash rm $HOME/test.db || true sqlite3 $HOME/test.db "CREATE TABLE test ( id INTEGER PRIMARY KEY AUTOINCREMENT, my_blob BLOB );"
import shutil import requests import os import uuid import sqlite3 url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg' my_bytes = scrape_image_from_url(url) # insert conn = sqlite3.connect('/home/x-kamstut/test.db') cursor = conn.cursor() query = f"INSERT INTO test (my_blob) VALUES (?);" dat = (my_bytes,) cursor.execute(query, dat) conn.commit() cursor.close() # retrieve conn = sqlite3.connect('/home/x-kamstut/test.db') cursor = conn.cursor() query = f"SELECT * from test where id = ?;" cursor.execute(query, (1,)) record = cursor.fetchall() img = record[0][1] tmp_filename = str(uuid.uuid4()) with open(f"{tmp_filename}.jpg", 'wb') as file: file.write(img) from IPython import display display.Image(f"{tmp_filename}.jpg")
-
Run a simple query to
SELECT
the first 5 rows of each table.%sql sqlite:////home/my-username/my.db
%%sql SELECT * FROM tablename LIMIT 5;
Make sure to replace "my-username" with your Anvil username, for example, x-kamstut is mine.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |