In this lab we will practiced data wrangling and visualization skills using real-time COVID-19 data maintained by the New York Times. The data has been used to create reports and data visualizations like this, and are archived on a GitHub repo here. Looking at the README in this repository we read:

“We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography … the historical files are the final counts at the end of each day … The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.”

For this lab we will use the historic, county level data which is stored as an updating CSV at this URL:

https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

Part 1

Here we are interested in the criteria used by the California Department of Public Health to maintain a watch list of counties that are being monitored for worsening coronavirus trends. There are six criteria used to place counties on the watch list: 1. Doing fewer than 150 tests per 100,000 residents daily (over a 7-day average)

  1. More than 100 new cases per 100,000 residents over the past 14 days…

  2. 25 new cases per 100,000 residents and an 8% test positivity rate

  3. 10% or greater increase in COVID-19 hospitalized patients over the past 3 days

  4. Fewer than 20% of ICU beds available

  5. Fewer than 25% ventilators available

Of these 6 conditions, we will be looking at the second. To do this, I set up a reproducible framework to communicate the following in a way that can be updated every time new data is released: 1. Cumulative cases in the 5 worst counties 2. Total NEW cases in the 5 worst counties 3. A list of safe counties 4. A text report describing the total new cases, total cumulative cases and number of safe counties.

I also built this analysis in such a way that running it will extract the most current data straight from the NY-Times URL, the state name is a parameter that can be changed allowing this report to be run for other states.

1.1 Reading in the data

Libraries
library(tidyverse)
library(readxl)
library(zoo)
library(knitr)
library(ggplot2)
covid = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")

state.of.interest = "California"

covid %>%
  filter(state == state.of.interest) %>%
  group_by(county) %>%
  mutate(newCases = cases - lag(cases)) %>%
  ungroup() ->
  newCases2

1.2 Five counties with the most cumulative cases in California

newCases2 %>%
  filter(date == max(date)) %>%
  arrange(-cases) %>%
  head(5) %>%
  select(county, cases) %>%
  knitr::kable(caption = paste("Most Cumulative Cases", state.of.interest),
               col.names = c("County", "Cumulative Cases"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("striped", full_width = F, font_size = 14)
Most Cumulative Cases California
County Cumulative Cases
Los Angeles 253,985
Riverside 55,073
Orange 52,121
San Bernardino 50,699
San Diego 42,742

1.3 Five counties with the most new cases in California

newCases2 %>%
  filter(date == max(date)) %>%
  arrange(-newCases) %>%
  head(5) %>%
  select(county, newCases) %>%
  knitr::kable(caption = paste("Most New Cases", state.of.interest),
               col.names = c("County", "New Cases"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("bordered", full_width = F, font_size = 14)
Most New Cases California
County New Cases
Los Angeles 809
San Diego 265
Orange 185
Fresno 159
San Bernardino 156

1.4 Exploring the data

pop <- read_excel("~/github/geog-176A-labs/data/PopulationEstimates.xls", skip = 2) %>%
  select (fips = "FIPStxt", state = "State", Area_Name, pop2019 = "POP_ESTIMATE_2019")

names(pop)
## [1] "fips"      "state"     "Area_Name" "pop2019"
nrow(pop)
## [1] 3273
str(pop)
## tibble [3,273 × 4] (S3: tbl_df/tbl/data.frame)
##  $ fips     : chr [1:3273] "00000" "01000" "01001" "01003" ...
##  $ state    : chr [1:3273] "US" "AL" "AL" "AL" ...
##  $ Area_Name: chr [1:3273] "United States" "Alabama" "Autauga County" "Baldwin County" ...
##  $ pop2019  : num [1:3273] 3.28e+08 4.90e+06 5.59e+04 2.23e+05 2.47e+04 ...
dim(pop)
## [1] 3273    4

1.5 Join population data to NYT covid data

covid_pop = left_join(newCases2, pop, by = "fips")

1.6 Five counties with the most cumulative cases in California per capita

covid_pop %>%
  filter(date == max(date)) %>%
  mutate(cases_pcap = cases/pop2019) %>%
  arrange(-cases_pcap) %>%
  head(5) %>%
  select(county, cases_pcap) %>%
  knitr::kable(caption = paste("Most Cumulative Cases Per Capita", state.of.interest),
               col.names = c("County", "Cumulative Cases Per Capita"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("bordered", full_width = F, font_size = 14)
Most Cumulative Cases Per Capita California
County Cumulative Cases Per Capita
Imperial 0.0622134
Kings 0.0464038
Kern 0.0341423
Tulare 0.0324199
Merced 0.0307584

1.7 Five counties with the most new cases in California per capita

covid_pop %>%
  filter(date == max(date)) %>%
  mutate(newcases_pcap = newCases/pop2019) %>%
  arrange(-newcases_pcap) %>%
  head(5) %>%
  select(county, newcases_pcap) %>%
  knitr::kable(caption = paste("Most New Cases Per Capita", state.of.interest),
               col.names = c("County", "New Cases Per Capita"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("bordered", full_width = F, font_size = 14)
Most New Cases Per Capita California
County New Cases Per Capita
Kings 0.0002615
San Benito 0.0002388
Monterey 0.0002027
Lake 0.0001708
Fresno 0.0001591

1.8 Filter data to last 14 days

covid_pop %>%
  filter(date >= max(date)-13) %>%
  group_by(county, pop2019) %>%
  summarize(newCases_p100 =  sum(newCases)) %>%
  mutate(newCases_p100 = newCases_p100 / (pop2019/100000)) %>%
  filter(newCases_p100 < 100) ->
  final

newCases2 %>%
  summarise(tot_cases = sum(cases)) -> x

newCases2 %>%
  summarise(tot_newcases = sum(newCases, na.rm = T)) -> y

1.9 Summary info

paste("The total number of cases is", x,
      "the total number of new cases is", y,
      "and the total number of safe counties is", nrow(final))
## [1] "The total number of cases is 48524512 the total number of new cases is 763315 and the total number of safe counties is 21"

Question 2

2.1 Graph state level data for four states and calculate the 7-day rolling mean

covid = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")

#part 2
covid %>%
  filter(state == "California" | state == "New York" | state == "Florida" | state == "Louisiana") %>%
  group_by(state, date) %>%
  summarise(cases = sum(cases)) %>%
  mutate(newCases = cases - lag(cases),
         roll7 = rollmean(newCases, 7, fill = NA, align="right")) %>%
  ungroup() %>%
  ggplot(aes(x = date))+
  geom_col(aes(y = newCases, color = state)) +
  geom_line(aes(y = roll7), size = 1) +
  facet_wrap(~state)+
  theme_dark() +
  labs(title = paste("New Reported Cases by Day in States of Interest"),
       color = "",
       x = "Date",
       y = "") +
  theme(plot.background = element_rect(fill = "white"),
        panel.background = element_rect(fill = "white"),
        plot.title = element_text(size = 14, face = 'bold'))

This data is not per capita, and is therefore misleading. Now I will calculate daily new cases per capita with the 7-day rolling mean.


2.2 State level data per capita

pop <- read_excel("~/github/geog-176A-labs/data/PopulationEstimates.xls", skip = 2) %>%
  select (fips = "FIPStxt", state = "Area_Name", pop2019 = "POP_ESTIMATE_2019")

covid_pop = left_join(covid, pop, by = "state")

covid_pop %>%
  filter(state == "California" | state == "New York" | state == "Florida" | state == "Louisiana") %>%
  select(date, state, cases, pop2019) %>%
  group_by(state, date, pop2019) %>%
  summarise(cases = sum(cases)) %>%
  ungroup() %>%
  group_by(state) %>%
  mutate(newCases = cases - lag(cases),
         newCases_pcap = newCases/pop2019,
         roll7 = rollmean(newCases_pcap, 7, fill = NA, align="right")) %>%
  ggplot(aes(x = date))+
  geom_col(aes(y = newCases_pcap, color = state)) +
  geom_line(aes(y = roll7), size = 1) +
  facet_wrap(~state)+
  theme_dark() +
  labs(title = paste("New Reported Cases Per Capita by Day in States of Interest"),
       color = "",
       x = "Date",
       y = "") +
  theme(plot.title = element_text(size = 10, face = 'bold'))

By looking at the number of cases per capita, some states appear to do better than others compared to the raw data. The peak in the CA plot appears much smaller, while the peak in the LA plot appears much larger because CA’s population is much larger. So while CA has more cases than LA, they have a smaller ratio to the total population.