TDM 20100: Project 8 — 2023
Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. Learning SQL is well worth your time!
Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we’ve spent much of this semester in the terminal, we will start in the terminal using SQLite.
Scope: SQL, SQlite
Dataset(s)
For this project, we will be using the lahman
sqlite database. This database contains the data in the directory
-
/anvil/projects/tdm/data/lahman
You may get some more lahman
database information from this youtube video youtube.com/watch?v=tS_-oTbsDzs
To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
For every following cell where you want to run a SQL query, prepend %%sql
to the top of the cell — just like we do for R or bash cells.
Questions
Question 1 (2 pts)
Get started by taking a look at the available tables in the Lahman database.
-
What tables are available in the Lahman database?
You’ll want to prepend
In SQLite, you can show the tables using the following query:
Unfortunately, SQLite-specific functions can’t be run in a Jupyter Lab cell like that. Instead, we need to use a different query.
|
Question 2 (2 pts)
-
It’s always a good idea to learn what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table(s). Write and run queries that return the first 5 rows of data for the
people
table, thebatting
table, thefielding
table, themanagers
table, and 2 more tables of your choice (you can pick any 2 more tables to consider). -
To get a better idea of the size of the data, you can use the
count
clause to get the number of rows in each table. Write and run 6 queries that return the number of rows in each of these 6 tables.
Run each query in a separate cell, and remember to limit the query to return only 5 rows each. You can use the |
Question 3 (1 pt)
Okay, let’s dig into the people
table a little bit. Run the following query.
SELECT * FROM people LIMIT 5;
As you can see, every row has a playerID
for each player. It is a unique identifier or key for the people
table. In Question 2, you checked several tables, so you might already notice that a few tables contain this playerID
such as in table batting
, fielding
, managers
etc. The playerID
relates data from those tables to the specific player.
-
Let us find information about a famous baseball player named
Mike Trout
from thepeople
table.
The |
Question 4 (1 pt)
Now you understand what the playerID
means inside the database.
SELECT * FROM batting where playerID ='troutmi01'
The query will output all fields of data for Mike Trout from table batting
-
First use Mike Trout’s
playerID
(from Question 3) to find the number of his home runs in each season. -
Now make a second query that only displays Mike Trout’s data for the year
2022
but includes the playerID, teamID, and number of home runs.
The |
Question 5 (2 pts)
Now pick a different baseball player (your choice!) and find that baseball player’s information in the database.
-
For this baseball player, please find the baseball player’s information from the
people
table -
Please use the
playerID
to get this player’s number of home runs in the year 2022. -
Please join the
people
table and thebatting
table, to display information from the fields ofnameLast
,nameFirst
,weight
,height
,birthYear
, and number of home runs in the year 2022, along with theteamID
, andyearID
.
You may refer to the following website for SQLite table join examples www.sqlitetutorial.net/sqlite-join/ Use |
Project 08 Assignment Checklist
-
Jupyter notebook with your code, comments and output for questions 1 to 5
-
firstname-lastname-project08.ipynb
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project |