In this lab we will practiced data wrangling and visualization skills using real-time COVID-19 data maintained by the New York Times. The data has been used to create reports and data visualizations like this, and are archived on a GitHub repo here. Looking at the README in this repository we read:
“We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography … the historical files are the final counts at the end of each day … The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.”
For this lab we will use the historic, county level data which is stored as an updating CSV at this URL:
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Here we are interested in the criteria used by the California Department of Public Health to maintain a watch list of counties that are being monitored for worsening coronavirus trends. There are six criteria used to place counties on the watch list: 1. Doing fewer than 150 tests per 100,000 residents daily (over a 7-day average)
More than 100 new cases per 100,000 residents over the past 14 days…
25 new cases per 100,000 residents and an 8% test positivity rate
10% or greater increase in COVID-19 hospitalized patients over the past 3 days
Fewer than 20% of ICU beds available
Fewer than 25% ventilators available
Of these 6 conditions, we will be looking at the second. To do this, I set up a reproducible framework to communicate the following in a way that can be updated every time new data is released: 1. Cumulative cases in the 5 worst counties 2. Total NEW cases in the 5 worst counties 3. A list of safe counties 4. A text report describing the total new cases, total cumulative cases and number of safe counties.
I also built this analysis in such a way that running it will extract the most current data straight from the NY-Times URL, the state name is a parameter that can be changed allowing this report to be run for other states.
library(tidyverse)
library(readxl)
library(zoo)
library(knitr)
library(ggplot2)covid = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
state.of.interest = "California"
covid %>%
  filter(state == state.of.interest) %>%
  group_by(county) %>%
  mutate(newCases = cases - lag(cases)) %>%
  ungroup() ->
  newCases2newCases2 %>%
  filter(date == max(date)) %>%
  arrange(-cases) %>%
  head(5) %>%
  select(county, cases) %>%
  knitr::kable(caption = paste("Most Cumulative Cases", state.of.interest),
               col.names = c("County", "Cumulative Cases"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("striped", full_width = F, font_size = 14)| County | Cumulative Cases | 
|---|---|
| Los Angeles | 253,985 | 
| Riverside | 55,073 | 
| Orange | 52,121 | 
| San Bernardino | 50,699 | 
| San Diego | 42,742 | 
newCases2 %>%
  filter(date == max(date)) %>%
  arrange(-newCases) %>%
  head(5) %>%
  select(county, newCases) %>%
  knitr::kable(caption = paste("Most New Cases", state.of.interest),
               col.names = c("County", "New Cases"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("bordered", full_width = F, font_size = 14)| County | New Cases | 
|---|---|
| Los Angeles | 809 | 
| San Diego | 265 | 
| Orange | 185 | 
| Fresno | 159 | 
| San Bernardino | 156 | 
pop <- read_excel("~/github/geog-176A-labs/data/PopulationEstimates.xls", skip = 2) %>%
  select (fips = "FIPStxt", state = "State", Area_Name, pop2019 = "POP_ESTIMATE_2019")
names(pop)## [1] "fips"      "state"     "Area_Name" "pop2019"nrow(pop)## [1] 3273str(pop)## tibble [3,273 × 4] (S3: tbl_df/tbl/data.frame)
##  $ fips     : chr [1:3273] "00000" "01000" "01001" "01003" ...
##  $ state    : chr [1:3273] "US" "AL" "AL" "AL" ...
##  $ Area_Name: chr [1:3273] "United States" "Alabama" "Autauga County" "Baldwin County" ...
##  $ pop2019  : num [1:3273] 3.28e+08 4.90e+06 5.59e+04 2.23e+05 2.47e+04 ...dim(pop)## [1] 3273    4covid_pop = left_join(newCases2, pop, by = "fips")covid_pop %>%
  filter(date == max(date)) %>%
  mutate(cases_pcap = cases/pop2019) %>%
  arrange(-cases_pcap) %>%
  head(5) %>%
  select(county, cases_pcap) %>%
  knitr::kable(caption = paste("Most Cumulative Cases Per Capita", state.of.interest),
               col.names = c("County", "Cumulative Cases Per Capita"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("bordered", full_width = F, font_size = 14)| County | Cumulative Cases Per Capita | 
|---|---|
| Imperial | 0.0622134 | 
| Kings | 0.0464038 | 
| Kern | 0.0341423 | 
| Tulare | 0.0324199 | 
| Merced | 0.0307584 | 
covid_pop %>%
  filter(date == max(date)) %>%
  mutate(newcases_pcap = newCases/pop2019) %>%
  arrange(-newcases_pcap) %>%
  head(5) %>%
  select(county, newcases_pcap) %>%
  knitr::kable(caption = paste("Most New Cases Per Capita", state.of.interest),
               col.names = c("County", "New Cases Per Capita"),
               format.args = list(big.mark = ",")) %>%
  kableExtra::kable_styling("bordered", full_width = F, font_size = 14)| County | New Cases Per Capita | 
|---|---|
| Kings | 0.0002615 | 
| San Benito | 0.0002388 | 
| Monterey | 0.0002027 | 
| Lake | 0.0001708 | 
| Fresno | 0.0001591 | 
covid_pop %>%
  filter(date >= max(date)-13) %>%
  group_by(county, pop2019) %>%
  summarize(newCases_p100 =  sum(newCases)) %>%
  mutate(newCases_p100 = newCases_p100 / (pop2019/100000)) %>%
  filter(newCases_p100 < 100) ->
  final
newCases2 %>%
  summarise(tot_cases = sum(cases)) -> x
newCases2 %>%
  summarise(tot_newcases = sum(newCases, na.rm = T)) -> ypaste("The total number of cases is", x,
      "the total number of new cases is", y,
      "and the total number of safe counties is", nrow(final))## [1] "The total number of cases is 48524512 the total number of new cases is 763315 and the total number of safe counties is 21"covid = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
#part 2
covid %>%
  filter(state == "California" | state == "New York" | state == "Florida" | state == "Louisiana") %>%
  group_by(state, date) %>%
  summarise(cases = sum(cases)) %>%
  mutate(newCases = cases - lag(cases),
         roll7 = rollmean(newCases, 7, fill = NA, align="right")) %>%
  ungroup() %>%
  ggplot(aes(x = date))+
  geom_col(aes(y = newCases, color = state)) +
  geom_line(aes(y = roll7), size = 1) +
  facet_wrap(~state)+
  theme_dark() +
  labs(title = paste("New Reported Cases by Day in States of Interest"),
       color = "",
       x = "Date",
       y = "") +
  theme(plot.background = element_rect(fill = "white"),
        panel.background = element_rect(fill = "white"),
        plot.title = element_text(size = 14, face = 'bold'))This data is not per capita, and is therefore misleading. Now I will calculate daily new cases per capita with the 7-day rolling mean.
pop <- read_excel("~/github/geog-176A-labs/data/PopulationEstimates.xls", skip = 2) %>%
  select (fips = "FIPStxt", state = "Area_Name", pop2019 = "POP_ESTIMATE_2019")
covid_pop = left_join(covid, pop, by = "state")
covid_pop %>%
  filter(state == "California" | state == "New York" | state == "Florida" | state == "Louisiana") %>%
  select(date, state, cases, pop2019) %>%
  group_by(state, date, pop2019) %>%
  summarise(cases = sum(cases)) %>%
  ungroup() %>%
  group_by(state) %>%
  mutate(newCases = cases - lag(cases),
         newCases_pcap = newCases/pop2019,
         roll7 = rollmean(newCases_pcap, 7, fill = NA, align="right")) %>%
  ggplot(aes(x = date))+
  geom_col(aes(y = newCases_pcap, color = state)) +
  geom_line(aes(y = roll7), size = 1) +
  facet_wrap(~state)+
  theme_dark() +
  labs(title = paste("New Reported Cases Per Capita by Day in States of Interest"),
       color = "",
       x = "Date",
       y = "") +
  theme(plot.title = element_text(size = 10, face = 'bold'))By looking at the number of cases per capita, some states appear to do better than others compared to the raw data. The peak in the CA plot appears much smaller, while the peak in the LA plot appears much larger because CA’s population is much larger. So while CA has more cases than LA, they have a smaller ratio to the total population.