Air pollutants are derived from many sources, but are not always visible to the naked eye. According to the United States Environmental Protection Agency (EPA), air pollution can be defined as a combination of gases and particles in the atmosphere that can reach harmful concentrations and have a negative effect on the health of people, animals, and the environment. There have been considerable air pollution prevention and control measures put in place over the years. The focus of this data study is to follow the trend of pollution in the United States over a 15-year period to determine if there is a linear trend in NO2 mean levels in the U.S and if there are significant differences in NO2 mean levels among U.S. regions.
TEST 1:
Null Hypothesis: There is not a linear trend in NO2 mean levels in the U.S. between the years 2000 to 2015.
Alternative Hypothesis: There is a linear trend in NO2 mean levels in the U.S. between the years 2000 to 2015.
TEST 2:
Null Hypothesis: There is not a significant difference in NO2 mean levels among U.S. regions between the years 2000 to 2015.
Alternative Hypothesis: There is a significant difference in NO2 mean levels among U.S. regions between the years 2000 to 2015.
'''
Requirements:
1) Unzip data file from https://data.world/data-society/us-air-pollution-data
2) Google API key gkey in config.py
'''
# import modules
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import gmaps
from config import gkey
from pprint import pprint
from scipy import stats
from scipy.stats import linregress
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
The data set that we analyzed was sourced from data.world (https://data.world/data-society/us-air-pollution-data). With a file size of 411.49MB, the dataset contains 1,746,661 rows and 28 columns of air pollution data documented by the U.S. EPA on a daily basis from 2000 to 2016. The dataset contains information about four pollutants: Nitrogen Dioxide (NO2), Sulphur Dioxide (SO2), Carbon Monoxide (CO) and Ozone (O3). It describes the following information for each of the pollutants in 5 different columns: units measured, mean of concentration, air quality index, maximum value of concentration in a given day, and hour when the value was recorded. In addition, the dataset includes the state, city, address, and date of monitoring.
# Import Data (Smaller test data to start with)
file_path = os.path.join('..', 'Project1', 'data-society-us-air-pollution-data', 'uspollution', 'pollution_us_2000_2016.csv')
raw_data = pd.read_csv(file_path)
# Examine Colume datae type
display(raw_data.head())
display(raw_data.info())
The U.S. Air Pollution data was generally clean and did not require a lot of manipulation. We removed “Country of Mexico” and all related data as our study focused solely on the United States. We also made the decision to choose 1 pollutant to focus on in the interest of time. We chose to focus on the pollutant Nitrogen Dioxide as it is the most harmful to humans. High concentrations of Nitrogen Dioxide in the atmosphere can irritate the airways in the human respiratory system causing coughing, wheezing or difficulty breathing. Prolonged exposure to Nitrogen Dioxide can make the lungs more susceptible to infection, asthma and respiratory disease. In addition to this, Nitrogen Dioxide can mix with other gases in the atmosphere to produce haze, acid rain and contribute to nutrient pollution in the coastal waters (epa.gov).
Data was grouped two ways for analysis of each hypothesis. First, data was grouped by year to complete the year-over-year analysis. Next, data was grouped by region to complete the regional analysis. States were grouped into regions based on the regional chart located on the U.S. Embassy website. The four regions were; West, Midwest, South and Northeast.
# Examine the shape of the dataset set
data_dem = raw_data.shape
num_raws = data_dem[0]
num_cols = data_dem[1]
print(f"This air pollution dataset has {num_raws} rows of data, and each row has {num_cols} columns.")
# How Many Unique States?
state_uni = raw_data["State"].unique()
print(f"There are {len(state_uni)} states in the dataset.")
# We notice some states are missing. Identify missing states.
'''source: https://gist.github.com/rugbyprof/76575b470b6772ce8fa0c49e23931d97'''
state_list = {"AL":"Alabama" ,"AK":"Alaska" ,"AZ":"Arizona" ,"AR":"Arkansas" ,"CA":"California",
"CO":"Colorado" ,"CT":"Connecticut","DE":"Delaware" ,"FL":"Florida" ,"GA":"Georgia" ,
"HI":"Hawaii" ,"ID":"Idaho" ,"IL":"Illinois" ,"IN":"Indiana" ,"IA":"Iowa" ,
"KS":"Kansas" ,"KY":"Kentucky" ,"LA":"Louisiana" ,"ME":"Maine" ,"MD":"Maryland" ,
"MA":"Massachusetts","MI":"Michigan" ,"MN":"Minnesota" ,"MS":"Mississippi" ,"MO":"Missouri" ,
"MT":"Montana" ,"NE":"Nebraska" ,"NV":"Nevada" ,"NH":"New Hampshire","NJ":"New Jersey",
"NM":"New Mexico" ,"NY":"New York" ,"NC":"North Carolina","ND":"North Dakota" ,"OH":"Ohio" ,
"OK":"Oklahoma" ,"OR":"Oregon" ,"PA":"Pennsylvania" ,"RI":"Rhode Island" ,"SC":"South Carolina",
"SD":"South Dakota" ,"TN":"Tennessee" ,"TX":"Texas" ,"UT":"Utah" ,"VT":"Vermont",
"VA":"Virginia" ,"WA":"Washington" ,"WV":"West Virginia" ,"WI":"Wisconsin" ,"WY":"Wyoming",
"DC":"District Of Columbia",
}
missing_states = []
for itm in state_list:
if state_list[itm] not in state_uni:
missing_states.append(state_list[itm])
print('Missing states: ', missing_states)
# Adding the unique states in the dataset plus missing states equal to 52. Something is not right.
pprint(np.sort(state_uni))
# Drop Country of Mexico
trim_data = raw_data.loc[raw_data['State'] != 'Country Of Mexico']
# Keep only NO2 Mean Data
col_list = ['State', 'Date Local', 'NO2 Mean']
trim_data = trim_data[col_list]
# Examine whether we have any NaN in the data
for col in col_list:
print(col, ': ', trim_data.loc[trim_data[col].isna()].shape[0])
# Examine any duplicates
trim_data = trim_data.drop_duplicates(subset = col_list)
# Add Year Column
trim_data['year'] = trim_data['Date Local'].str[:4].astype(int)
# Add Month Column
trim_data['month'] = trim_data['Date Local'].str[5:7].astype(int)
# Examine year month range
trim_data_group = trim_data.groupby(['year', 'month']).count()
trim_data_group = trim_data_group.reset_index(drop = False)
## Sort by Year and then month
trim_data_group = trim_data_group.sort_values(['year', 'month'])
# Visualize any missing months
year_month = trim_data_group[['year', 'month']]
year_month_group = year_month.groupby('year').count()
year_month_group = year_month_group.reset_index(drop = False)
year_month_group = year_month_group.rename(columns = {'month' : 'month_count'})
year_month_group.plot(kind = 'bar', x = 'year', y = 'month_count', color = 'darkorange', legend=None);
## Add chart details
plt.title('Month Count per Year');
plt.ylabel('Month Count');
### let's drop 2016 as it's a partial year in the dataset
# Drop 2016 Data due to partial year
trim_data = trim_data.loc[trim_data['year'] != 2016].copy()
# Create a Tidy Dataset
tidy_data = trim_data.copy()
# Save Tidy Data
tidy_data.to_csv('air_pollution_tidy.csv', index = False)
# Read Tidy Data back
tidy_data = pd.read_csv('air_pollution_tidy.csv')
tidy_data.tail()
Analysis was done in 2 phases. The data was first analyzed in accordance with hypothesis 1 (year-over-year). The data grouped by year was plotted to create a visual display. Based on the year over year plot, it appears as though there is a trend in the data. However, a positive conclusion could not be made based on this plot alone. Next, a regression analysis was performed. The regression analysis revealed a linear trend. There is a negative correlation in the data, meaning that NO2 pollution is decreasing each year.
# Year-over-year NO2 level Analysis
# Group data by years
group_data = tidy_data.groupby('year').mean()
## Reset index
group_data = group_data.reset_index(drop = False)
# Use .arange to make x values for the following charts/graphs
x_values = np.arange(group_data['year'].min(), group_data['year'].max() + 1, 1)
y_values = group_data['NO2 Mean']
# Set up plot size
plt.figure(figsize = (9, 9), dpi = 100)
plt.axes(facecolor = 'lightgrey')
# Line graph for NO2 Mean Values
plt.plot(x_values, y_values, linewidth = 4, label = 'NO2 Mean Values', color = 'darkorange')
plt.title('US NO2 Mean Values Year over Year')
plt.xlabel('Year')
plt.ylabel('NO2 Level (parts per billion)')
plt.legend(loc = 'best')
plt.savefig('./us_no2_plot.png')
plt.xticks(ticks = x_values, rotation = 45)
plt.show()
# Linear regression analysis for US NO2 Level
plt.figure(figsize = (9, 9), dpi = 100)
plt.scatter(x_values, y_values, color = 'darkorange')
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.plot(x_values, regress_values, "--", color = 'black')
plt.annotate(line_eq, (2008, 13.5), fontsize = 15, color = "black")
plt.title('US NO2 Mean Values Year over Year')
plt.xlabel('Year')
plt.ylabel('NO2 Values (parts per billion)')
print(f"The r-squared is: {rvalue**2}")
print(f'The p-value is {pvalue}')
plt.savefig('./us_no2_plot_linregress.png')
plt.xticks(ticks = x_values, rotation = 45)
plt.show()
Based on the analysis, we found that there is a downward linear trend in pollution year over year. NO2 pollution appears to be decreasing overall.
Finally, the data was analyzed in accordance with hypothesis 2 (regional). The regional analysis used the ANOVA and T tests. The ANOVA test revealed that the p-value was less than 0.05%, showing that the data does not have a normal distribution and does not have equal variance. Therefore, we could not perform the Tukey test. The ANOVA Test concluded that of the 4 regions, one is different than the rest. Next, we grouped the regions into the following pairs: Northeast vs Midwest, Northeast vs South, Northeast vs West, West vs South and Midwest vs West. A Pairwise T test was used to compare the groups’ mean values. The Pairwise T test revealed that the Northeast region is different from each region in the group.
# Regional NO2 Mean Values Analysis
# Define each state's region
region_dict = {
'Alabama':'South', 'Alaska':'West', 'Arizona':'West', 'Arkansas':'South', 'California':'West',
'Colorado':'West', 'Connecticut':'Northeast', 'Delaware':'Northeast', 'District Of Columbia' : 'Northeast', 'Florida':'South',
'Georgia':'South', 'Hawaii':'West', 'Idaho':'West', 'Illinois':'Midwest', 'Indiana':'Midwest', 'Iowa':'Midwest',
'Kansas':'Midwest', 'Kentucky':'South', 'Louisiana':'South', 'Maine':'Northeast', 'Maryland':'South',
'Massachusetts':'Northeast', 'Michigan':'Midwest', 'Minnesota':'Midwest', 'Mississippi':'South', 'Missouri':'Midwest',
'Montana':'West', 'Nebraska':'Midwest', 'Nevada':'West', 'New Hampshire':'Northeast', 'New Jersey':'Northeast',
'New Mexico':'West', 'New York':'Northeast', 'North Carolina':'South', 'North Dakota':'Midwest', 'Ohio':'Midwest',
'Oklahoma':'South', 'Oregon':'West', 'Pennsylvania':'Northeast', 'Rhode Island':'Northeast', 'South Carolina':'South',
'South Dakota':'Midwest','Tennessee':'South','Texas':'South','Utah':'West','Vermont':'Northeast',
'Virginia':'South', 'Washington':'West', 'West Virginia':'South','Wisconsin':'Midwest','Wyoming':'West',
}
# Map Region into the dataframe
data_no2 = tidy_data.copy()
data_no2['Region'] = data_no2['State'].map(region_dict)
data_no2.head()
# Create a DataFrame group by region and then year
grouped_region_yr = data_no2.groupby(['Region', 'year']).mean()
grouped_region_yr = grouped_region_yr.reset_index(drop = False)
grouped_region_yr.head()
# Visualization
northeast = grouped_region_yr.loc[grouped_region_yr['Region'] == 'Northeast'].copy()
midwest = grouped_region_yr.loc[grouped_region_yr['Region'] == 'Midwest'].copy()
south = grouped_region_yr.loc[grouped_region_yr['Region'] == 'South'].copy()
west = grouped_region_yr.loc[grouped_region_yr['Region'] == 'West'].copy()
# Plotting the regional data
plt.figure(figsize = (9, 9), dpi = 100)
plt.axes(facecolor = 'lightgrey')
## Northeast
x = northeast['year']
y = northeast['NO2 Mean']
plt.plot(x, y, color = 'darkorange', linewidth = 3, label = 'Northeast')
## West
x1 = west['year']
y1 = west['NO2 Mean']
plt.plot(x1, y1, color = 'b', linewidth = 3, label = 'West')
## Midwest
x2 = midwest['year']
y2 = midwest['NO2 Mean']
plt.plot(x2, y2, color = 'teal', linewidth = 3, label = 'Midwest')
## South
x3 = south['year']
y3 = south['NO2 Mean']
plt.plot(x3, y3, color = 'crimson', linewidth = 3, label = 'South')
# Adding Plot Details
plt.title('Mean NO2 Level by Year')
plt.xlabel('Year')
plt.ylabel('NO2 Level (Parts per billion)')
plt.xticks(ticks = x, rotation = 45)
plt.legend();
# Box plot visualization
plt.figure(figsize = (20, 10))
plt.axes(facecolor = 'lightgrey')
sns.boxplot(data = grouped_region_yr, x = 'Region', y = 'NO2 Mean',
palette = ['darkorange', 'b', 'teal', 'crimson']);
sns.stripplot(data = grouped_region_yr, x = 'Region', y = 'NO2 Mean', color = 'white', jitter = .2, size = 10)
plt.ylabel('NO2 Level (Parts per billion)')
plt.title('Mean NO2 Level by Year', size = 20);
# ANOVA Test
# Define alpha
alpha = 0.05
'''
The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean.
The test is applied to samples from two or more groups, possibly with differing sizes.
'''
# One-way Anova test
statistic, pvalue = stats.f_oneway(northeast["NO2 Mean"], midwest["NO2 Mean"], south["NO2 Mean"], west["NO2 Mean"])
# Result printout
print('p-value = ', round(pvalue, 3))
if pvalue < alpha:
print('Reject null hypothesis.')
else:
print('Fail to reject null hypothesis.')
# Two-Sample Student T-Test
# Define alpha
alpha = 0.05
'''
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
This test assumes that the populations have identical variances by default.
'''
region_pair= ['Northeast vs Midwest', 'Northeast vs West', 'Northeast vs South',
'Midwest vs South' , 'Midwest vs West' ,
'West vs South' ,
]
ttest_stat = []
ttest_pval = []
# Comparing Northeast vs Midwest
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(northeast["NO2 Mean"], midwest["NO2 Mean"], equal_var=False)
ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
# Comparing Northeast vs West
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(northeast["NO2 Mean"], west["NO2 Mean"], equal_var=False)
ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
# Comparing Northeast vs South
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(northeast["NO2 Mean"], south["NO2 Mean"], equal_var=False)
ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
# Comparing Midwest vs South
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(midwest["NO2 Mean"], south["NO2 Mean"], equal_var=False)
ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
# Comparing Midwest vs West
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(midwest["NO2 Mean"], west["NO2 Mean"], equal_var=False)
ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
# Compare West vs South
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(west["NO2 Mean"], south["NO2 Mean"], equal_var=False)
ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
region_pair_dict = {
'Region Pair' : region_pair,
'Statistics' : ttest_stat,
'P_value' : ttest_pval,
}
region_pair_df = pd.DataFrame(region_pair_dict)
region_pair_df = region_pair_df.sort_values('P_value')
region_pair_df = region_pair_df.reset_index( drop = True)
region_pair_df
Visually we can conclude that the South region is different than the Northeast and West regions. However, the results of the T tests revealed that, with a significant value of 0.05%, we can statistically reject the null hypothesis. The regions are not the same in terms of NO2 pollution average.
# Polluption Visualization on a heatmap
grouped_state_year = tidy_data.groupby(['State', 'year']).mean()
grouped_state_year = grouped_state_year.reset_index()
year = 2015
data_year = grouped_state_year.loc[grouped_state_year['year'] == year].copy()
state = grouped_state_year['State'].unique()
# Call and lookup each state's coordinates
lat_list = []
lng_list = []
for itm in state:
# Build the endpoint URL
target_url = ('https://maps.googleapis.com/maps/api/geocode/json?'
'address={0}&key={1}').format(itm + ', USA', gkey)
# Run a request to endpoint and convert result to json
geo_data = requests.get(target_url).json()
# Find Each state's geo location
lat = geo_data['results'][0]['geometry']['location']['lat']
lng = geo_data['results'][0]['geometry']['location']['lng']
# Save latitude and longitude into lists
lat_list.append(lat)
lng_list.append(lng)
state_coor_list = {
'State' : state,
'lat' : lat_list,
'lng' : lng_list,
}
state_coor = pd.DataFrame(state_coor_list)
state_coor.to_csv('state_coordinates.csv', index = False)
merge_coor_data = pd.merge(state_coor, data_year, on = 'State', how = 'left')
# Configure gmaps
gmaps.configure(api_key=gkey)
locations = merge_coor_data[['lat', 'lng']].copy()
weights = merge_coor_data['NO2 Mean']
# Plot Heatmap
fig = gmaps.figure(zoom_level = 3, center = (40, -75))
# Create heat layer
heat_layer = gmaps.heatmap_layer(locations, weights=weights,
dissipating = False, max_intensity=15,
point_radius=3)
# Add layer
fig.add_layer(heat_layer)
# Display figure
fig
Overall, we see a general decreasing trend in air pollution as well as a regional difference. There are some limitations with this analysis. The first being that the data did not include observations for the states of Mississippi, Montana, Nebraska and Vermont. The second limitation is that the downward trend cannot continue pass 0.
Works Cited
“Nitrogen Dioxide (NO2) Pollution.” EPA, Environmental Protection Agency, 13 June 2019, www.epa.gov/no2-pollution.
“Nitrogen Dioxide (NO2).” Department of Agriculture, Water and the Environment, www.environment.gov.au/protection/publications/factsheet-nitrogen-dioxide-no2.
The Regions of the United States. (n.d.). Retrieved from U.S. Embassy & Consulate in the Republic of Korea: https://kr.usembassy.gov/education-culture/infopedia-usa/travel-usa/regions-united-states/