- Wed 06 March 2019
- datascience
- #folium, #data visualization, #austin
In this post, I explore patterns in traffic incidents in Travis County. Here are the questions that I explore:
- Is traffic worse on weekends than on weekdays?
- Which hour is the worst for traffic?
- Which areas in Travis Counrty are the worst for collisions?
- How do patterns in traffic change throughout the month?
Setup Imports¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import datetime as dt
import calendar as cal
data = pd.read_csv('Real-Time_Traffic_Incident_Reports.csv')
Part 1: Data Exploration¶
data.head()
Feature Engineering¶
I want to look at data only from 2018, so to do this I have to alter the 'Published Date' column so that I can convert it to date time and filter out non-2018 entries:
data['Day'] = data['Published Date'].apply(lambda x: x.split()[0])
data['Day'].head()
data['Day'] = pd.to_datetime(data['Day'])
data = data[data['Day'].dt.year == 2018]
data.head()
I create a column for time of the day for later analysis. I extract the time from 'Published Date' and then convert to military time:
def extract_time(x):
    time = x.split()[1:3]
    #convert to military time
    if (time[1] == 'AM' and time[0].split(':')[0] == '12'):
        hour = '00'
        return hour + ':' + time[0].split(':')[1] + ':' + time[0].split(':')[2]
    if (time[1] == 'PM'):
        hour = int(time[0].split(':')[0])
        if hour != 12:
            hour += 12
        return str(hour) + ':' + time[0].split(':')[1] + ':' + time[0].split(':')[2]
    return time[0]
data['Time'] = data['Published Date'].apply(extract_time)
data['Time'] = pd.to_datetime(data['Time']).dt.time
data['Time'].head()
data.info()
Here, I look at the type of traffic incidents that occurs:
unique, counts = np.unique(data['Issue Reported'], return_counts=True)
traffic_types = pd.DataFrame({'Traffic Types':unique,'Count':counts})
traffic_types = traffic_types.sort_values(by=['Count'], ascending = False)
traffic_types.reset_index(drop=True)
sns.set_style('darkgrid')
g = sns.factorplot('Issue Reported', data=data,kind='count', size=6, aspect=2)
g.set_xticklabels(rotation=45)
Part 2: Data Cleaning¶
I have three concerns for data cleaning:
- Values that are 0
- Values that are missing or null
- Unncessary columns
I'll go through cleaning each of these
print(data[data['Latitude']==0].count(),'\n------------------------')
print(data[data['Longitude']==0].count())
Latitude and longitude have the same amount of zeros, and I will infer that these values are 0 for the same entries. I'll just drop any value that is way out of bounds from a normal Travis County latitude/longitude:
#remove incorrect latitude and longitudes
data = data.drop(data[(data['Longitude'] > -90) |
                      (data['Longitude'] < -100) | 
                      (data['Latitude'] > 40) |
                      (data['Latitude'] < 20)].index)
missing = data.isnull().sum().sort_values(ascending=False)
pct = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
#creates dataframe with missing and pct missing
miss_data = pd.concat([missing, pct], axis=1, keys=['Missing','Percent'])
#shows columns with missing values
miss_data[miss_data['Missing']>0]
Drop missing data:
#gets rid of null values for longitude and by extension latitude
data = data[pd.notnull(data['Longitude'])]
Drop irrelevant columns:
data.drop(['Traffic Report ID', 
           'Published Date',
           'Location',
           'Address',
           'Status',
           'Status Date'],axis=1, inplace=True)
data = data.reset_index(drop=True)
data.info()
#Austin Coordinates: 30.2672° N, 97.7431° W
ATX_COOR = [30.2672, -97.7431]
MAX_RECORDS = 1000
#create empty map zoomed in on Austin
map1 = folium.Map(location = ATX_COOR, zoom_start = 12)
#add marker for every record in data
for i in range(0,MAX_RECORDS):
    folium.Marker(location = [data.iloc[i]['Latitude'],data.iloc[i]['Longitude']], popup=data.iloc[i]['Issue Reported']).add_to(map1)
display(map1)
The map is a little crowded, but zooming in and looking around it looks like the two main highways (Loop 1 and I-35) and downtown Austin have the highest accident rates. This makes sense because they are the roads that see the highest amount of traffic. The next plot will show clusters of where traffic incidents occur. You can click on the clusters to further explore certain areas:
Cluster Map of traffic incidents in 2018¶
from folium.plugins import MarkerCluster
#create empty map zoomed in on Austin
map2 = folium.Map(location = ATX_COOR, zoom_start = 12)
mc = MarkerCluster()
#add marker for every record in data
for i in range(0,1500):
    mc.add_child(folium.Marker(location = [data.iloc[i]['Latitude'],data.iloc[i]['Longitude']],popup=data.iloc[i]['Issue Reported']))
map2.add_child(mc)
display(map2)
The first thing that jumped out was a huge amount of crashes at the intersection of I-35 and US-183 with 14/1500 accidents shown occuring at this one intersection in North Austin. Other major clusters include the intersection of TX-71 and I-35, downtown Austin, and the South Austin intersection of East Riverside Dr and I-35.
Animated Heatmaps¶
I chose August for my analyses because it is commonly cited as the highest crash rate month. The graph below shows that this is the case for Austin as well, with May also having around as many accidents as well.
Here is an interesting read on the most dangerous times to drive: https://www.bactrack.com/blogs/expert-center/35042821-the-most-dangerous-times-on-the-road
You can click on the slider on the bottom left to manually control the animated maps.
plt.figure(figsize=(12,6))
ax = sns.countplot(x=data['Day'].dt.month)
ax.set_xlabel('Month')
plt.show()
Down below I create a temporary data frame to store all entries where I deemed the traffic incident as an "accident", and then sort by the day of the month.
temp_df = data
temp_df = temp_df[temp_df['Day'].dt.month == 8]
temp_df = temp_df[(temp_df['Issue Reported'] == 'Crash Service') |
                  (temp_df['Issue Reported'] == 'Crash Urgent') |
                  (temp_df['Issue Reported'] == 'COLLISION') |
                  (temp_df['Issue Reported'] == 'TRFC HAZD/ DEBRIS') |
                  (temp_df['Issue Reported'] == 'COLLISION WITH INJURY') |
                  (temp_df['Issue Reported'] == 'COLLISION/ LVNG SCN') |
                  (temp_df['Issue Reported'] == 'COLLISION/PRIVATE PROPERTY') |
                  (temp_df['Issue Reported'] == 'VEHICLE FIRE') |
                  (temp_df['Issue Reported'] == 'AUTO/ PED') |
                  (temp_df['Issue Reported'] == 'Traffic Fatality')]
temp_df = temp_df.sort_values(by='Day')
temp_df = temp_df.reset_index(drop=True)
temp_df.tail()
g = sns.factorplot('Issue Reported', data=temp_df,kind='count', size=6, aspect=2)
g.set_xticklabels(rotation=45)
Most accidents seem to be "urgent crash" followed by "crash service", but many of these names could be similar accidents with different codes to them. For this reason, analyzing the type of crash seems pretty futile, so I will stick to lumping these together and analyzing their patterns:
August Accidents by Day of Month¶
heat_df = temp_df
heat_df = heat_df[['Latitude', 'Longitude', 'Day']]
heat_df = heat_df.dropna(axis=0)
heat_df['Weight'] = heat_df['Day'].dt.day.astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude','Weight'])
from folium import plugins
from folium.plugins import HeatMap
ATX_COOR = [30.2672, -97.7431]
austin_heatmap = folium.Map(location = ATX_COOR, 
                            tiles = 'Stamen Terrain', 
                            zoom_start = 11)
heat_data = [[[row['Latitude'],row['Longitude']]
             for index, row in heat_df[heat_df['Weight']==i].iterrows()]
            for i in range(1,32)]
hm = plugins.HeatMapWithTime(data=heat_data,display_index=True,max_opacity=0.8)
hm.add_to(austin_heatmap)
austin_heatmap
It looks like the distribution of accidents from day to day is random, but it may be due to a day being a weekend or not. Some days tend to have accidents mainly concentrated on I-35 while others are evenly dispursed throughout the county.
August Accidents by Day of Week¶
heat_df2 = temp_df
heat_df2 = heat_df2.dropna(axis = 0)
heat_df2['Weight'] = heat_df2['Day'].apply(lambda x: x.dayofweek).astype(float)
heat_df2 = heat_df2.dropna(axis=0, subset=['Latitude','Longitude','Weight'])
heat_df2.head()
austin_heatmap2 = folium.Map(location = ATX_COOR,
                            tiles = 'Stamen Terrain', 
                            zoom_start = 11)
heat_data2 = [[[row['Latitude'],row['Longitude']]
             for index, row in heat_df2[heat_df2['Weight']==i].iterrows()]
            for i in range(7)]
hm = plugins.HeatMapWithTime(data=heat_data2,display_index=True,max_opacity=0.8)
hm.add_to(austin_heatmap2)
austin_heatmap2
This shows accidents where 1 = Monday and 7 = Sunday. More accidents occur on weekdays, but strangely enough there is no major pattern change on weekends. It seems that the distribution of accidents could be more dependent on other factors not in the dataset such as weather. Lets see how time of the day affects accidents:
August Accidents by Hour of Day¶
heat_df3 = temp_df
heat_df3 = heat_df3.dropna(axis = 0)
heat_df3['Weight'] = heat_df3['Time'].apply(lambda x: x.hour).astype(float)
heat_df3 = heat_df3.dropna(axis=0, subset=['Latitude','Longitude','Weight'])
ATX_COOR = [30.2672, -97.7431]
austin_heatmap3 = folium.Map(location = ATX_COOR,
                            tiles = 'Stamen Terrain', 
                            zoom_start = 11)
heat_data3 = [[[row['Latitude'],row['Longitude']]
             for index, row in heat_df3[heat_df3['Weight']==i].iterrows()]
            for i in range(24)]
hm = plugins.HeatMapWithTime(data=heat_data3,index=[i for i in range(24)],display_index=True,max_opacity=0.8)
hm.add_to(austin_heatmap3)
austin_heatmap3
This graph shows accidents by hour in military time starting with midnight-1am as "0". Here is what I can make out:
- The rate of accidents seems to stay strong until around 4am.
- The hours of 5 and 6 seem to have very little accidents, which makes sense because very few cars are on the road.
- The hours of 7 and 8 seem to be all centrally located around inner Austin where most people commute for their jobs.
- There is almost no accidents until 12 when it picks up and is spread throughout Austin.
- These accidents increase steadily until they peak around the hour of 10pm. I expected there to be more crashes during rush hour, but it seems that there are the most accidents as night begins.
Summary¶
This project has a lot of potential for pattern exploration that I did not get into. My code is highly reusable for any location as long as you have coordinate points and the coordinates to your own city. There is also a lot more room to explore other questions such as which zip codes have the most accidents, which areas are more likely to have certain types of accidents, and traffic by hour on weekdays vs. weekends to name a few. I went through data exploration, data cleaning, and took on a more analytical role to uncover patterns in traffic without the help of machine learning. I hope you enjoyed the post, and feel free to reuse my code to uncover patterns in your own city!
