STAT 39000: Project 8 — Fall 2020
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Dataset:
The following questions will use the dataset found in Scholar:
/class/datamine/data/flights/subset/YYYY.csv
An example of the data for the year 1987 can be found here.
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. |
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like |
Question 1
Let’s say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
Write a line of awk
code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the 2008.csv
file.
Modify your code to instead print the average number of flights that occur on weekend days, followed by the average number of flights that occur on the weekdays.
You don’t need a large if statement to do this, you can use the |
-
Lines of
awk
code that solves the problem. -
The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
-
The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
Question 2
We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let’s use the tools we’ve learned to explore this a little bit.
Take a look at airports.csv
. In particular run the following:
head airports.csv
Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within awk
. This is messy and we would prefer to create a new file called new_airports.csv
without any quotes. Write a line of code to do this.
You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It’s important to practice and learn ways to fix these things. |
You could use |
If you leave out the column number argument to |
|
-
Line of
awk
code used to create the new dataset.
Question 3
Write a line of commands that creates a new dataset called az_fl_airports.txt
. az_fl_airports.txt
should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),new_airports.csv
as a starting point.
How many airports are there? Did you expect this? Use a line of bash code to count this.
Create a new dataset (called az_fl_flights.txt
) that contains all of the data for flights into or out of Florida and Arizona (using the 2008.csv
file). Use the newly created dataset, az_fl_airports.txt
to accomplish this.
|
-
All UNIX commands used to answer the questions.
-
The number of airports.
-
1-2 sentences explaining whether you expected this number of airports.
Question 4
Write a bash script that accepts the start year, end year, and filename containing airport codes (az_fl_airports.txt
), and outputs the data for flights into or out of any of the airports listed in the provided filename (az_fl_airports.txt
). The script should output data for flights using all of the years of data in the provided range. Run the bash script to create a new file called az_fl_flights_total.csv
.
-
The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-
The line of UNIX code you used to execute the script and create the new dataset.
Question 5
Use the newly created dataset, az_fl_flights_total.csv
, from question 4 to calculate the total number of flights into and out of both states by month, and by year, for a total of 3 columns (year, month, flights). Export this information to a new file called snowbirds.csv
.
Load up your newly created dataset and use either R or Python (or some other tool) to create a graphic that illustrates whether or not we believe the "snowbird effect" effects flights. Include a description of your graph, as well as your (anecdotal) conclusion.
You can use 1 dimensional arrays to accomplish this if the key is the combination of, for example, the year and month. |
-
The line of
awk
code used to create the new dataset,snowbirds.csv
. -
Code used to create the visualization in a code chunk.
-
The generated plot as either a png or jpg/jpeg.
-
1-2 sentences describing your plot and your conclusion.