Group 3 - Project 1 - Air Pollution in the United States¶

Introduction

Air pollutants are derived from many sources, but are not always visible to the naked eye. According to the United States Environmental Protection Agency (EPA), air pollution can be defined as a combination of gases and particles in the atmosphere that can reach harmful concentrations and have a negative effect on the health of people, animals, and the environment. There have been considerable air pollution prevention and control measures put in place over the years. The focus of this data study is to follow the trend of pollution in the United States over a 15-year period to determine if there is a linear trend in NO2 mean levels in the U.S and if there are significant differences in NO2 mean levels among U.S. regions.

Hypotheses

TEST 1:
Null Hypothesis: There is not a linear trend in NO2 mean levels in the U.S. between the years 2000 to 2015.
Alternative Hypothesis: There is a linear trend in NO2 mean levels in the U.S. between the years 2000 to 2015.

TEST 2:
Null Hypothesis: There is not a significant difference in NO2 mean levels among U.S. regions between the years 2000 to 2015.
Alternative Hypothesis: There is a significant difference in NO2 mean levels among U.S. regions between the years 2000 to 2015.

'''
Requirements: 
1) Unzip data file from https://data.world/data-society/us-air-pollution-data
2) Google API key gkey in config.py
'''

# import modules
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import gmaps

from config import gkey
from pprint import pprint
from scipy import stats
from scipy.stats import linregress

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Data Summary

The data set that we analyzed was sourced from data.world (https://data.world/data-society/us-air-pollution-data). With a file size of 411.49MB, the dataset contains 1,746,661 rows and 28 columns of air pollution data documented by the U.S. EPA on a daily basis from 2000 to 2016. The dataset contains information about four pollutants: Nitrogen Dioxide (NO2), Sulphur Dioxide (SO2), Carbon Monoxide (CO) and Ozone (O3). It describes the following information for each of the pollutants in 5 different columns: units measured, mean of concentration, air quality index, maximum value of concentration in a given day, and hour when the value was recorded. In addition, the dataset includes the state, city, address, and date of monitoring.

# Import Data (Smaller test data to start with)
file_path = os.path.join('..', 'Project1', 'data-society-us-air-pollution-data', 'uspollution', 'pollution_us_2000_2016.csv')
raw_data = pd.read_csv(file_path)

# Examine Colume datae type
display(raw_data.head())
display(raw_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1746661 entries, 0 to 1746660
Data columns (total 29 columns):
Unnamed: 0           int64
State Code           int64
County Code          int64
Site Num             int64
Address              object
State                object
County               object
City                 object
Date Local           object
NO2 Units            object
NO2 Mean             float64
NO2 1st Max Value    float64
NO2 1st Max Hour     int64
NO2 AQI              int64
O3 Units             object
O3 Mean              float64
O3 1st Max Value     float64
O3 1st Max Hour      int64
O3 AQI               int64
SO2 Units            object
SO2 Mean             float64
SO2 1st Max Value    float64
SO2 1st Max Hour     int64
SO2 AQI              float64
CO Units             object
CO Mean              float64
CO 1st Max Value     float64
CO 1st Max Hour      int64
CO AQI               float64
dtypes: float64(10), int64(10), object(9)
memory usage: 386.5+ MB

None

Data Exploration & Cleanup¶

The U.S. Air Pollution data was generally clean and did not require a lot of manipulation. We removed “Country of Mexico” and all related data as our study focused solely on the United States. We also made the decision to choose 1 pollutant to focus on in the interest of time. We chose to focus on the pollutant Nitrogen Dioxide as it is the most harmful to humans. High concentrations of Nitrogen Dioxide in the atmosphere can irritate the airways in the human respiratory system causing coughing, wheezing or difficulty breathing. Prolonged exposure to Nitrogen Dioxide can make the lungs more susceptible to infection, asthma and respiratory disease. In addition to this, Nitrogen Dioxide can mix with other gases in the atmosphere to produce haze, acid rain and contribute to nutrient pollution in the coastal waters (epa.gov).

Data was grouped two ways for analysis of each hypothesis. First, data was grouped by year to complete the year-over-year analysis. Next, data was grouped by region to complete the regional analysis. States were grouped into regions based on the regional chart located on the U.S. Embassy website. The four regions were; West, Midwest, South and Northeast.

# Examine the shape of the dataset set
data_dem = raw_data.shape
num_raws = data_dem[0]
num_cols = data_dem[1]
print(f"This air pollution dataset has {num_raws} rows of data, and each row has {num_cols} columns.")

This air pollution dataset has 1746661 rows of data, and each row has 29 columns.

# How Many Unique States?
state_uni = raw_data["State"].unique()
print(f"There are {len(state_uni)} states in the dataset.")

There are 47 states in the dataset.

# We notice some states are missing. Identify missing states.
'''source: https://gist.github.com/rugbyprof/76575b470b6772ce8fa0c49e23931d97'''
state_list = {"AL":"Alabama"      ,"AK":"Alaska"     ,"AZ":"Arizona"       ,"AR":"Arkansas"     ,"CA":"California",
              "CO":"Colorado"     ,"CT":"Connecticut","DE":"Delaware"      ,"FL":"Florida"      ,"GA":"Georgia"   ,
              "HI":"Hawaii"       ,"ID":"Idaho"      ,"IL":"Illinois"      ,"IN":"Indiana"      ,"IA":"Iowa"      ,
              "KS":"Kansas"       ,"KY":"Kentucky"   ,"LA":"Louisiana"     ,"ME":"Maine"        ,"MD":"Maryland"  ,
              "MA":"Massachusetts","MI":"Michigan"   ,"MN":"Minnesota"     ,"MS":"Mississippi"  ,"MO":"Missouri"  ,
              "MT":"Montana"      ,"NE":"Nebraska"   ,"NV":"Nevada"        ,"NH":"New Hampshire","NJ":"New Jersey",
              "NM":"New Mexico"   ,"NY":"New York"   ,"NC":"North Carolina","ND":"North Dakota" ,"OH":"Ohio"      ,
              "OK":"Oklahoma"     ,"OR":"Oregon"     ,"PA":"Pennsylvania"  ,"RI":"Rhode Island" ,"SC":"South Carolina",
              "SD":"South Dakota" ,"TN":"Tennessee"  ,"TX":"Texas"         ,"UT":"Utah"         ,"VT":"Vermont",
              "VA":"Virginia"     ,"WA":"Washington" ,"WV":"West Virginia" ,"WI":"Wisconsin"    ,"WY":"Wyoming",
              "DC":"District Of Columbia",
         } 
missing_states = []
for itm in state_list:
    if state_list[itm] not in state_uni:
        missing_states.append(state_list[itm])
        
print('Missing states: ', missing_states)

Missing states:  ['Mississippi', 'Montana', 'Nebraska', 'Vermont', 'West Virginia']

# Adding the unique states in the dataset plus missing states equal to 52. Something is not right.
pprint(np.sort(state_uni))

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Country Of Mexico', 'Delaware',
       'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Missouri', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
       'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
       'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia',
       'Washington', 'Wisconsin', 'Wyoming'], dtype=object)

# Drop Country of Mexico
trim_data = raw_data.loc[raw_data['State'] != 'Country Of Mexico']

# Keep only NO2 Mean Data
col_list = ['State', 'Date Local', 'NO2 Mean']
trim_data = trim_data[col_list]

# Examine whether we have any NaN in the data
for col in col_list:
    print(col, ': ', trim_data.loc[trim_data[col].isna()].shape[0])

State :  0
Date Local :  0
NO2 Mean :  0

# Examine any duplicates
trim_data = trim_data.drop_duplicates(subset = col_list)

# Add Year Column
trim_data['year'] = trim_data['Date Local'].str[:4].astype(int)

# Add Month Column
trim_data['month'] = trim_data['Date Local'].str[5:7].astype(int)

# Examine year month range
trim_data_group = trim_data.groupby(['year', 'month']).count()

trim_data_group = trim_data_group.reset_index(drop = False)

## Sort by Year and then month
trim_data_group = trim_data_group.sort_values(['year', 'month'])

# Visualize any missing months
year_month = trim_data_group[['year', 'month']]
year_month_group = year_month.groupby('year').count()
year_month_group = year_month_group.reset_index(drop = False)
year_month_group = year_month_group.rename(columns = {'month' : 'month_count'})

year_month_group.plot(kind = 'bar', x = 'year', y = 'month_count', color = 'darkorange', legend=None);
## Add chart details
plt.title('Month Count per Year');
plt.ylabel('Month Count');
### let's drop 2016 as it's a partial year in the dataset

# Drop 2016 Data due to partial year
trim_data = trim_data.loc[trim_data['year'] != 2016].copy()

# Create a Tidy Dataset
tidy_data = trim_data.copy()

# Save Tidy Data
tidy_data.to_csv('air_pollution_tidy.csv', index = False)

# Read Tidy Data back
tidy_data = pd.read_csv('air_pollution_tidy.csv')
tidy_data.tail()

Analysis

Analysis was done in 2 phases. The data was first analyzed in accordance with hypothesis 1 (year-over-year). The data grouped by year was plotted to create a visual display. Based on the year over year plot, it appears as though there is a trend in the data. However, a positive conclusion could not be made based on this plot alone. Next, a regression analysis was performed. The regression analysis revealed a linear trend. There is a negative correlation in the data, meaning that NO2 pollution is decreasing each year.

# Year-over-year NO2 level Analysis

# Group data by years
group_data = tidy_data.groupby('year').mean()
## Reset index
group_data = group_data.reset_index(drop = False)

# Use .arange to make x values for the following charts/graphs
x_values = np.arange(group_data['year'].min(), group_data['year'].max() + 1, 1)
y_values = group_data['NO2 Mean']

# Set up plot size
plt.figure(figsize = (9, 9), dpi = 100)
plt.axes(facecolor = 'lightgrey')

# Line graph for NO2 Mean Values
plt.plot(x_values, y_values, linewidth = 4, label = 'NO2 Mean Values', color = 'darkorange')
plt.title('US NO2 Mean Values Year over Year')
plt.xlabel('Year')
plt.ylabel('NO2 Level (parts per billion)')
plt.legend(loc = 'best')
plt.savefig('./us_no2_plot.png')
plt.xticks(ticks = x_values, rotation = 45)
plt.show()

# Linear regression analysis for US NO2 Level
plt.figure(figsize = (9, 9), dpi = 100)
plt.scatter(x_values, y_values, color = 'darkorange')

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.plot(x_values, regress_values, "--", color = 'black')
plt.annotate(line_eq, (2008, 13.5), fontsize = 15, color = "black")

plt.title('US NO2 Mean Values Year over Year')
plt.xlabel('Year')
plt.ylabel('NO2 Values (parts per billion)')
print(f"The r-squared is: {rvalue**2}")
print(f'The p-value is {pvalue}')
plt.savefig('./us_no2_plot_linregress.png')
plt.xticks(ticks = x_values, rotation = 45)
plt.show()

The r-squared is: 0.9714764629547499
The p-value is 3.258674931817196e-12

Based on the analysis, we found that there is a downward linear trend in pollution year over year. NO2 pollution appears to be decreasing overall.

Finally, the data was analyzed in accordance with hypothesis 2 (regional). The regional analysis used the ANOVA and T tests. The ANOVA test revealed that the p-value was less than 0.05%, showing that the data does not have a normal distribution and does not have equal variance. Therefore, we could not perform the Tukey test. The ANOVA Test concluded that of the 4 regions, one is different than the rest. Next, we grouped the regions into the following pairs: Northeast vs Midwest, Northeast vs South, Northeast vs West, West vs South and Midwest vs West. A Pairwise T test was used to compare the groups’ mean values. The Pairwise T test revealed that the Northeast region is different from each region in the group.

# Regional NO2 Mean Values Analysis

# Define each state's region
region_dict = {
    'Alabama':'South', 'Alaska':'West', 'Arizona':'West', 'Arkansas':'South', 'California':'West',
    'Colorado':'West', 'Connecticut':'Northeast', 'Delaware':'Northeast', 'District Of Columbia' : 'Northeast', 'Florida':'South',
    'Georgia':'South', 'Hawaii':'West', 'Idaho':'West', 'Illinois':'Midwest', 'Indiana':'Midwest', 'Iowa':'Midwest',
    'Kansas':'Midwest', 'Kentucky':'South', 'Louisiana':'South', 'Maine':'Northeast', 'Maryland':'South',
    'Massachusetts':'Northeast', 'Michigan':'Midwest', 'Minnesota':'Midwest', 'Mississippi':'South', 'Missouri':'Midwest',
    'Montana':'West', 'Nebraska':'Midwest', 'Nevada':'West', 'New Hampshire':'Northeast', 'New Jersey':'Northeast',
    'New Mexico':'West', 'New York':'Northeast', 'North Carolina':'South', 'North Dakota':'Midwest', 'Ohio':'Midwest',
    'Oklahoma':'South', 'Oregon':'West', 'Pennsylvania':'Northeast', 'Rhode Island':'Northeast', 'South Carolina':'South',
    'South Dakota':'Midwest','Tennessee':'South','Texas':'South','Utah':'West','Vermont':'Northeast',
    'Virginia':'South', 'Washington':'West', 'West Virginia':'South','Wisconsin':'Midwest','Wyoming':'West',
}

# Map Region into the dataframe
data_no2 = tidy_data.copy()
data_no2['Region'] = data_no2['State'].map(region_dict)

data_no2.head()

# Create a DataFrame group by region and then year
grouped_region_yr = data_no2.groupby(['Region', 'year']).mean()
grouped_region_yr = grouped_region_yr.reset_index(drop = False)
grouped_region_yr.head()

# Visualization
northeast = grouped_region_yr.loc[grouped_region_yr['Region'] == 'Northeast'].copy()
midwest   = grouped_region_yr.loc[grouped_region_yr['Region'] == 'Midwest'].copy()
south     = grouped_region_yr.loc[grouped_region_yr['Region'] == 'South'].copy()
west      = grouped_region_yr.loc[grouped_region_yr['Region'] == 'West'].copy()

# Plotting the regional data
plt.figure(figsize = (9, 9), dpi = 100)
plt.axes(facecolor = 'lightgrey')

## Northeast
x = northeast['year']
y = northeast['NO2 Mean']
plt.plot(x, y, color = 'darkorange', linewidth = 3, label = 'Northeast')

## West
x1 = west['year']
y1 = west['NO2 Mean']
plt.plot(x1, y1, color = 'b', linewidth = 3, label = 'West')

## Midwest
x2 = midwest['year']
y2 = midwest['NO2 Mean']
plt.plot(x2, y2, color = 'teal', linewidth = 3, label = 'Midwest')

## South
x3 = south['year']
y3 = south['NO2 Mean']
plt.plot(x3, y3, color = 'crimson', linewidth = 3, label = 'South')

# Adding Plot Details
plt.title('Mean NO2 Level by Year')
plt.xlabel('Year')
plt.ylabel('NO2 Level (Parts per billion)')
plt.xticks(ticks = x, rotation = 45)

plt.legend();

# Box plot visualization
plt.figure(figsize = (20, 10))
plt.axes(facecolor = 'lightgrey')

sns.boxplot(data = grouped_region_yr, x = 'Region', y = 'NO2 Mean',
            palette = ['darkorange', 'b', 'teal', 'crimson']);

sns.stripplot(data = grouped_region_yr, x = 'Region', y = 'NO2 Mean', color = 'white', jitter = .2, size = 10)

plt.ylabel('NO2 Level (Parts per billion)')
plt.title('Mean NO2 Level by Year', size = 20);

# ANOVA Test

# Define alpha
alpha = 0.05

'''
The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. 
The test is applied to samples from two or more groups, possibly with differing sizes.
'''

# One-way Anova test
statistic, pvalue = stats.f_oneway(northeast["NO2 Mean"], midwest["NO2 Mean"], south["NO2 Mean"], west["NO2 Mean"])

# Result printout
print('p-value = ', round(pvalue, 3))
if pvalue < alpha: 
    print('Reject null hypothesis.')
else:
    print('Fail to reject null hypothesis.')

p-value =  0.002
Reject null hypothesis.

# Two-Sample Student T-Test

# Define alpha
alpha = 0.05

'''
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. 
This test assumes that the populations have identical variances by default.
'''
region_pair= ['Northeast vs Midwest', 'Northeast vs West', 'Northeast vs South', 
              'Midwest vs South'    , 'Midwest vs West'  ,
              'West vs South'       ,
             ]
ttest_stat = []
ttest_pval = []

# Comparing Northeast vs Midwest
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(northeast["NO2 Mean"], midwest["NO2 Mean"], equal_var=False)

ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
    
# Comparing Northeast vs West
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(northeast["NO2 Mean"], west["NO2 Mean"], equal_var=False)

ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
    
# Comparing Northeast vs South
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(northeast["NO2 Mean"], south["NO2 Mean"], equal_var=False)

ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
    
# Comparing Midwest vs South
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(midwest["NO2 Mean"], south["NO2 Mean"], equal_var=False)

ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
    
# Comparing Midwest vs West
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(midwest["NO2 Mean"], west["NO2 Mean"], equal_var=False)

ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))
    
# Compare West vs South
# Calculate Independent (Two Sample) T-Test
statistic, pvalue = stats.ttest_ind(west["NO2 Mean"], south["NO2 Mean"], equal_var=False)

ttest_stat.append(round(statistic, 2))
ttest_pval.append(round(pvalue, 2))

region_pair_dict = {
    'Region Pair' : region_pair,
    'Statistics'  : ttest_stat,
    'P_value'     : ttest_pval,
}

region_pair_df = pd.DataFrame(region_pair_dict)

region_pair_df = region_pair_df.sort_values('P_value')

region_pair_df = region_pair_df.reset_index( drop = True)

region_pair_df

Visually we can conclude that the South region is different than the Northeast and West regions. However, the results of the T tests revealed that, with a significant value of 0.05%, we can statistically reject the null hypothesis. The regions are not the same in terms of NO2 pollution average.

# Polluption Visualization on a heatmap
grouped_state_year = tidy_data.groupby(['State', 'year']).mean()
grouped_state_year = grouped_state_year.reset_index()

year = 2015
data_year = grouped_state_year.loc[grouped_state_year['year'] == year].copy()

state = grouped_state_year['State'].unique()

# Call and lookup each state's coordinates
lat_list = []
lng_list = []

for itm in state:

    # Build the endpoint URL
    target_url = ('https://maps.googleapis.com/maps/api/geocode/json?'
        'address={0}&key={1}').format(itm + ', USA', gkey)

    # Run a request to endpoint and convert result to json
    geo_data = requests.get(target_url).json()
    
    # Find Each state's geo location
    lat = geo_data['results'][0]['geometry']['location']['lat']
    lng = geo_data['results'][0]['geometry']['location']['lng']
    
    # Save latitude and longitude into lists
    lat_list.append(lat)
    lng_list.append(lng)


state_coor_list = {
    'State' : state,
    'lat'   : lat_list,
    'lng'   : lng_list,
}

state_coor = pd.DataFrame(state_coor_list)
state_coor.to_csv('state_coordinates.csv', index = False)

merge_coor_data = pd.merge(state_coor, data_year, on = 'State', how = 'left')

# Configure gmaps
gmaps.configure(api_key=gkey)

locations = merge_coor_data[['lat', 'lng']].copy()

weights = merge_coor_data['NO2 Mean']

# Plot Heatmap
fig = gmaps.figure(zoom_level = 3, center = (40, -75))

# Create heat layer
heat_layer = gmaps.heatmap_layer(locations, weights=weights, 
                                 dissipating = False, max_intensity=15,
                                 point_radius=3)


# Add layer
fig.add_layer(heat_layer)

# Display figure
fig

Implications

Overall, we see a general decreasing trend in air pollution as well as a regional difference. There are some limitations with this analysis. The first being that the data did not include observations for the states of Mississippi, Montana, Nebraska and Vermont. The second limitation is that the downward trend cannot continue pass 0.

Works Cited
“Nitrogen Dioxide (NO2) Pollution.” EPA, Environmental Protection Agency, 13 June 2019, www.epa.gov/no2-pollution.
“Nitrogen Dioxide (NO2).” Department of Agriculture, Water and the Environment, www.environment.gov.au/protection/publications/factsheet-nitrogen-dioxide-no2.
The Regions of the United States. (n.d.). Retrieved from U.S. Embassy & Consulate in the Republic of Korea: https://kr.usembassy.gov/education-culture/infopedia-usa/travel-usa/regions-united-states/

	Unnamed: 0	State Code	County Code	Site Num	Address	State	County	City	Date Local	NO2 Units	...	SO2 Units	SO2 Mean	SO2 1st Max Value	SO2 1st Max Hour	SO2 AQI	CO Units	CO Mean	CO 1st Max Value	CO 1st Max Hour	CO AQI
0	0	4	13	3002	1645 E ROOSEVELT ST-CENTRAL PHOENIX STN	Arizona	Maricopa	Phoenix	2000-01-01	Parts per billion	...	Parts per billion	3.000000	9.0	21	13.0	Parts per million	1.145833	4.2	21	NaN
1	1	4	13	3002	1645 E ROOSEVELT ST-CENTRAL PHOENIX STN	Arizona	Maricopa	Phoenix	2000-01-01	Parts per billion	...	Parts per billion	3.000000	9.0	21	13.0	Parts per million	0.878947	2.2	23	25.0
2	2	4	13	3002	1645 E ROOSEVELT ST-CENTRAL PHOENIX STN	Arizona	Maricopa	Phoenix	2000-01-01	Parts per billion	...	Parts per billion	2.975000	6.6	23	NaN	Parts per million	1.145833	4.2	21	NaN
3	3	4	13	3002	1645 E ROOSEVELT ST-CENTRAL PHOENIX STN	Arizona	Maricopa	Phoenix	2000-01-01	Parts per billion	...	Parts per billion	2.975000	6.6	23	NaN	Parts per million	0.878947	2.2	23	25.0
4	4	4	13	3002	1645 E ROOSEVELT ST-CENTRAL PHOENIX STN	Arizona	Maricopa	Phoenix	2000-01-02	Parts per billion	...	Parts per billion	1.958333	3.0	22	4.0	Parts per million	0.850000	1.6	23	NaN

	Region	year	NO2 Mean	month
0	Midwest	2000	17.600168	6.637634
1	Midwest	2001	17.009553	6.861739
2	Midwest	2002	15.432214	6.398291
3	Midwest	2003	15.678108	6.764981
4	Midwest	2004	14.031407	6.515785

	State	Date Local	NO2 Mean	year	month
403216	Wyoming	2015-12-27	9.456522	2015	12
403217	Wyoming	2015-12-28	13.200000	2015	12
403218	Wyoming	2015-12-29	10.045833	2015	12
403219	Wyoming	2015-12-30	7.047826	2015	12
403220	Wyoming	2015-12-31	2.445833	2015	12

	State	Date Local	NO2 Mean	year	month	Region
0	Arizona	2000-01-01	19.041667	2000	1	West
1	Arizona	2000-01-02	22.958333	2000	1	West
2	Arizona	2000-01-03	38.125000	2000	1	West
3	Arizona	2000-01-04	40.260870	2000	1	West
4	Arizona	2000-01-05	48.450000	2000	1	West

	Region Pair	Statistics	P_value
0	Northeast vs South	3.49	0.00
1	West vs South	4.33	0.00
2	Midwest vs South	1.97	0.06
3	Midwest vs West	-1.79	0.08
4	Northeast vs Midwest	1.37	0.18
5	Northeast vs West	-0.26	0.79