In this analysis we look at the NOAA Storm Database to answer 2 questions. Which type of weather events are most harmful to human health, and which cause the most economic damage. From the data we find that Tornadoes cause the most overall damage in both respects. Other harmful and damaging types of weather events include: flood, heat, wind, hail and lightning. To arrive at these results we plot total fatalities, injuries, property damage and crop damage for each weather event with significantly high observations.
Load in required packages.
library(ggplot2)
library(reshape2)
library(dplyr)
Download the dataset, if statement is used to prevent from unnecessarily downloading.
filename <- "storm_data.csv"
if(!file.exists(filename)){download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","storm_data.csv")}
We read in the data.
stormData <- read.csv("storm_data.csv",na.strings = "")
To answer this question we are interested in variables indicating number of fatalities and injuries. We check The number of missing values in each.
colSums(sapply(stormData[,c("FATALITIES","INJURIES")],is.na))
## FATALITIES INJURIES
## 0 0
We subset the data by any observations that have at least 1 fatality or at least 1 injury. This removes a lot of data that is not having an impact on human health.
harmful_stormData <- subset(stormData,stormData$FATALITIES!=0 | stormData$INJURIES!=0)
To make a plot our first plot we need to total all the fatalities by each weather event. From the summary we see that the data has a positive skew.
totalFatal <- tapply(harmful_stormData$FATALITIES,harmful_stormData$EVTYPE,sum)
summary(totalFatal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 2.00 68.84 10.25 5633.00
To account for this we subset the list of total fatalities by including events which account for more than 0.1 % of all fatalities.To do this first calculate the each of the total fatalities as a percentage, this shows what percentage of total fatalities each weather event makes up for.
percentFatal <- 100*totalFatal/sum(totalFatal)
head(percentFatal,15)
## AVALANCE AVALANCHE BLACK ICE BLIZZARD blowing snow BLOWING SNOW
## 0.006602839 1.479035985 0.006602839 0.666886761 0.006602839 0.006602839
## BRUSH FIRE COASTAL FLOOD Coastal Flooding COASTAL FLOODING COASTAL FLOODING/EROSION Coastal Storm
## 0.000000000 0.019808518 0.013205678 0.006602839 0.000000000 0.000000000
## COASTAL STORM COASTALSTORM Cold
## 0.019808518 0.006602839 0.019808518
Just from the first 15 events we can see there are many weather events that account for a very small amount of the total fatalities, we will remove these as they will make the plot difficult to read and are not as relevant as weather events that make up for larger percentages of total fatalities. We choose a threshold of 0.1% as this leaves over 97% of all fatalities left in the data but removes a lot of weather events.
MajorityFatal_index<- percentFatal[which(percentFatal>0.1)]
sum(MajorityFatal_index)
## [1] 97.28623
totalFatal <- totalFatal[which(percentFatal>0.1)]
Now we have the weather events we are interested in we put them into a dataframe.
df_fatal <- data.frame(Event = names(totalFatal),Fatalities=totalFatal)
Before plotting we take a look at our data in descending order of total number of fatalities and see that Tornadoes are by far and away the biggest contributor. For purposes of readability we don’t include Tornadoes in our plot.
head(arrange(df_fatal,desc(Fatalities)))
## Event Fatalities
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
We make our plot excluding Tornadoes. We see that the events that cause the most fatalities are excessive heat/heat, flash flood/flood, lightning and TSTM winds.
df_fatal %>% ggplot(aes(x=Event,y=Fatalities))+geom_segment(aes(xend=Event,yend=0))+geom_point(size=2,color="red")+coord_flip()+theme_bw()+xlab("Event")+ggtitle("Weather Events resulting in fatality") +ylim(c(0,2000))+theme(text = element_text(size=8),
axis.text.x = element_text(angle=90, hjust=1))
## Warning: Removed 1 rows containing missing values (geom_segment).
## Warning: Removed 1 rows containing missing values (geom_point).
We repeat the same process to get results on total number of injuries, as we can see there is a similar positive skew.
totalInjur <- tapply(harmful_stormData$INJURIES,harmful_stormData$EVTYPE,sum)
summary(totalInjur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 2.00 638.76 35.25 91346.00
percentInjur <- 100*totalInjur/sum(totalInjur)
print(head(percentInjur,15))
## AVALANCE AVALANCHE BLACK ICE BLIZZARD blowing snow BLOWING SNOW
## 0.000000000 0.120972333 0.017078447 0.572839576 0.000711602 0.009250825
## BRUSH FIRE COASTAL FLOOD Coastal Flooding COASTAL FLOODING COASTAL FLOODING/EROSION Coastal Storm
## 0.001423204 0.001423204 0.000000000 0.000000000 0.003558010 0.000711602
## COASTAL STORM COASTALSTORM Cold
## 0.000711602 0.000000000 0.000000000
This time our subset accounts for over 98% of all injuires.
MajorityInjur_index<- percentInjur[which(percentInjur>0.1)]
sum(MajorityInjur_index)
## [1] 98.53552
totalInjur <- totalInjur[which(percentInjur>0.1)]
We create our dataframe in the same way.
df_injur <- data.frame(Event = names(totalInjur),Injuries=totalInjur)
We see again Tornadoes cause the most injuries
head(arrange(df_injur,desc(Injuries)))
## Event Injuries
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
Plot excluding Tornadoes. Events causing the most injury are Excessive heat, flood, lightning and TSTM wind.
df_injur %>% ggplot(aes(x=Event,y=Injuries))+geom_segment(aes(xend=Event,yend=0))+geom_point(size=2,color="blue")+coord_flip()+theme_bw()+xlab("Event")+ggtitle("Total Injuries from Weather Events") +ylim(c(0,7000))+theme(text = element_text(size=8), axis.text.x = element_text(angle=90, hjust=1))
## Warning: Removed 1 rows containing missing values (geom_segment).
## Warning: Removed 1 rows containing missing values (geom_point).
Lastly we want to include information about these events that was lost when we subset the original dataset. We now know how many total injuries and deaths were caused by these most harmful types of weather events, however if a certain weather event is far more common than another this could be why the total injuries/fatalities are much higher. To investigate this first create a dataframe with both the dataframes used for our plots. We then subset this dataframe to only include events in which the total fatalities and total injuries sum to more than the sum of the third quartiles of the respective variables. We call these the worst events.
mrg <- merge(df_fatal,df_injur,by="Event")
WorstEvents <- subset(mrg,mrg$Fatalities+mrg$Injuries>(quantile(mrg$Fatalities)[[4]]+quantile(mrg$Injuries)[[4]]))$Event
print(WorstEvents)
## [1] "EXCESSIVE HEAT" "FLASH FLOOD" "FLOOD" "HEAT" "ICE STORM" "LIGHTNING" "TORNADO" "TSTM WIND"
Now we count all the total observations of our worst events that results in fatality or injury.
No.harmfulEvents<- sapply(split(harmful_stormData,harmful_stormData$EVTYPE),nrow)[WorstEvents]
And count all worst events recorded
No.Events<-sapply(split(stormData,stormData$EVTYPE),nrow)[WorstEvents]
Now we have the percent of our worst events that resulted in a fatality or injury.
(No.harmfulEvents/No.Events)*100
## EXCESSIVE HEAT FLASH FLOOD FLOOD HEAT ICE STORM LIGHTNING TORNADO TSTM WIND
## 40.405244 1.715275 1.618890 27.249022 4.735793 20.978799 13.071292 1.332182
This is useful as we can see while Tornadoes overall have caused the most harmful outcomes to human health, a smaller percent of them result in death/injury. As we can see it takes far less events of excessive heat and lightning to result in more injury or death.
To answer this question we are interested in variables indicating Property damage and crop damage. We check The number of missing values in each.
colSums(sapply(stormData[,c("PROPDMG","CROPDMG")],is.na))
## PROPDMG CROPDMG
## 0 0
We then subset the data to include property damage of at least 1 or crop damage of at least 1 (units of USD). Also we select only variables we are interested in.
ecoDmg_data <- subset(stormData,stormData$PROPDMG!=0 | stormData$CROPDMG!=0)
ecoDmg_data <- select(ecoDmg_data,"EVTYPE","PROPDMG","CROPDMG")
We calculate the total Property damage by weather see that there is a large positive skew
TotalPropDmg <- tapply(ecoDmg_data$PROPDMG,ecoDmg_data$EVTYPE,sum)
summary(TotalPropDmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 5 50 25254 501 3212258
We subset the Total property damage by taking only the top 25% of the data as we are only interested in the events causing the most economic damage.
ThresholdDmg <- which(TotalPropDmg>quantile(TotalPropDmg)[[4]])
TotalPropDmg <- TotalPropDmg[ThresholdDmg]
We do the same for crop damage.
TotalCropDmg <- tapply(ecoDmg_data$CROPDMG,ecoDmg_data$EVTYPE,sum)
summary(TotalCropDmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 3196.8 8.6 579596.3
ThresholdDmg_crop <- which(TotalCropDmg>quantile(TotalCropDmg)[[4]])
TotalCropDmg <- TotalCropDmg[ThresholdDmg_crop]
We take the intersection of the top 25% of the total damage each case and create a dataframe with only events from this intersection.
both <- intersect(names(TotalCropDmg),names(TotalPropDmg))
ecoDmg_data <- data.frame(Events = both,PropertyDMG = TotalPropDmg[both],CropDMG=TotalCropDmg[both])
We melt the data into a tall skinny dataframe with in preparation for a faceted plot.
ecoPlot_data<- melt(ecoDmg_data,id=c("Events"))
Plot the total damages with respect to property and crop.
ggplot(ecoPlot_data,aes(fill=variable,x=Events,y=value))+geom_bar(position = "stack",stat = "identity") +coord_flip() + ylab("Total Damages (USD)") + ggtitle("Total Damages for Weather Events") +theme(text = element_text(size=8), axis.text.x = element_text(angle=90, hjust=1))
We see that Tornado, flood and TSTM winds are all big contributors as when investigating impact on human health. Hail is an event that causes economic damge but not as much damage to health.