In this post, I explore patterns in traffic incidents in Travis County. Here are the questions that I explore:

Is traffic worse on weekends than on weekdays?
Which hour is the worst for traffic?
Which areas in Travis Counrty are the worst for collisions?
How do patterns in traffic change throughout the month?

Setup Imports¶

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import folium

import datetime as dt
import calendar as cal

In [2]:

data = pd.read_csv('Real-Time_Traffic_Incident_Reports.csv')

Part 1: Data Exploration¶

In [3]:

data.head()

Out[3]:

	Traffic Report ID	Published Date	Issue Reported	Location	Latitude	Longitude	Address	Status	Status Date
0	C163BCD1CF90C984E9EDA4DBA311BCA369A7D1A1_15288...	06/13/2018 06:35:59 AM +0000	Crash Service	(30.283797,-97.741906)	30.283797	-97.741906	W 21ST ST & GUADALUPE ST	ARCHIVED	06/13/2018 09:00:03 AM +0000
1	2664C206999E7419517EE39E177797576A52201B_15381...	09/29/2018 12:52:58 AM +0000	Traffic Hazard	(30.380525,-97.737873)	30.380525	-97.737873	0 Mopac Sb To Research Sb Ramp	ARCHIVED	09/29/2018 01:35:03 AM +0000
2	5BB246A3223A89C7FB9B2EED9A2EBDA82141406D_15438...	12/03/2018 05:49:02 PM +0000	Crash Service	(30.278372,-97.671971)	30.278372	-97.671971	5800 Techni Center Dr	ARCHIVED	12/03/2018 06:30:03 PM +0000
3	6B20382196FB454E9FD06A33E60142902A2F0706_15288...	06/13/2018 10:15:36 AM +0000	Traffic Hazard	(30.339593,-97.700963)	30.339593	-97.700963	400-717 E ANDERSON LN EB	ARCHIVED	06/13/2018 11:20:03 AM +0000
4	7FD2528A7609AAECB6AF825AC275F98CFDFFF8AD_15382...	09/29/2018 12:47:30 PM +0000	Traffic Hazard	(30.281659,-97.728551)	30.281659	-97.728551	0 N Ih 35 Ud Sb To Mlk Ramp	ARCHIVED	09/29/2018 01:00:03 PM +0000

Feature Engineering¶

I want to look at data only from 2018, so to do this I have to alter the 'Published Date' column so that I can convert it to date time and filter out non-2018 entries:

In [4]:

data['Day'] = data['Published Date'].apply(lambda x: x.split()[0])
data['Day'].head()

Out[4]:

0    06/13/2018
1    09/29/2018
2    12/03/2018
3    06/13/2018
4    09/29/2018
Name: Day, dtype: object

In [5]:

data['Day'] = pd.to_datetime(data['Day'])
data = data[data['Day'].dt.year == 2018]
data.head()

Out[5]:

	Traffic Report ID	Published Date	Issue Reported	Location	Latitude	Longitude	Address	Status	Status Date	Day
0	C163BCD1CF90C984E9EDA4DBA311BCA369A7D1A1_15288...	06/13/2018 06:35:59 AM +0000	Crash Service	(30.283797,-97.741906)	30.283797	-97.741906	W 21ST ST & GUADALUPE ST	ARCHIVED	06/13/2018 09:00:03 AM +0000	2018-06-13
1	2664C206999E7419517EE39E177797576A52201B_15381...	09/29/2018 12:52:58 AM +0000	Traffic Hazard	(30.380525,-97.737873)	30.380525	-97.737873	0 Mopac Sb To Research Sb Ramp	ARCHIVED	09/29/2018 01:35:03 AM +0000	2018-09-29
2	5BB246A3223A89C7FB9B2EED9A2EBDA82141406D_15438...	12/03/2018 05:49:02 PM +0000	Crash Service	(30.278372,-97.671971)	30.278372	-97.671971	5800 Techni Center Dr	ARCHIVED	12/03/2018 06:30:03 PM +0000	2018-12-03
3	6B20382196FB454E9FD06A33E60142902A2F0706_15288...	06/13/2018 10:15:36 AM +0000	Traffic Hazard	(30.339593,-97.700963)	30.339593	-97.700963	400-717 E ANDERSON LN EB	ARCHIVED	06/13/2018 11:20:03 AM +0000	2018-06-13
4	7FD2528A7609AAECB6AF825AC275F98CFDFFF8AD_15382...	09/29/2018 12:47:30 PM +0000	Traffic Hazard	(30.281659,-97.728551)	30.281659	-97.728551	0 N Ih 35 Ud Sb To Mlk Ramp	ARCHIVED	09/29/2018 01:00:03 PM +0000	2018-09-29

I create a column for time of the day for later analysis. I extract the time from 'Published Date' and then convert to military time:

In [6]:

def extract_time(x):
    time = x.split()[1:3]
    #convert to military time
    if (time[1] == 'AM' and time[0].split(':')[0] == '12'):
        hour = '00'
        return hour + ':' + time[0].split(':')[1] + ':' + time[0].split(':')[2]
    if (time[1] == 'PM'):
        hour = int(time[0].split(':')[0])
        if hour != 12:
            hour += 12
        return str(hour) + ':' + time[0].split(':')[1] + ':' + time[0].split(':')[2]
    return time[0]
data['Time'] = data['Published Date'].apply(extract_time)
data['Time'] = pd.to_datetime(data['Time']).dt.time
data['Time'].head()

Out[6]:

0    06:35:59
1    00:52:58
2    17:49:02
3    10:15:36
4    12:47:30
Name: Time, dtype: object

In [7]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70100 entries, 0 to 87428
Data columns (total 11 columns):
Traffic Report ID    70100 non-null object
Published Date       70100 non-null object
Issue Reported       70100 non-null object
Location             69935 non-null object
Latitude             70076 non-null float64
Longitude            70076 non-null float64
Address              70100 non-null object
Status               68407 non-null object
Status Date          70100 non-null object
Day                  70100 non-null datetime64[ns]
Time                 70100 non-null object
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 6.4+ MB

Here, I look at the type of traffic incidents that occurs:

In [8]:

unique, counts = np.unique(data['Issue Reported'], return_counts=True)
traffic_types = pd.DataFrame({'Traffic Types':unique,'Count':counts})
traffic_types = traffic_types.sort_values(by=['Count'], ascending = False)
traffic_types.reset_index(drop=True)

Out[8]:

	Traffic Types	Count
0	Traffic Hazard	20861
1	Crash Urgent	15701
2	Crash Service	11496
3	COLLISION	4914
4	Traffic Impediment	4898
5	TRFC HAZD/ DEBRIS	4287
6	zSTALLED VEHICLE	3215
7	COLLISION WITH INJURY	1626
8	LOOSE LIVESTOCK	1501
9	COLLISN/ LVNG SCN	1013
10	COLLISION/PRIVATE PROPERTY	257
11	VEHICLE FIRE	148
12	BLOCKED DRIV/ HWY	97
13	ICY ROADWAY	25
14	BOAT ACCIDENT	21
15	AUTO/ PED	15
16	TRAFFIC FATALITY	11
17	FLEET ACC/ INJURY	10
18	N / HZRD TRFC VIOL	2
19	HIGH WATER	1
20	COLLISN / FTSRA	1

In [9]:

sns.set_style('darkgrid')
g = sns.factorplot('Issue Reported', data=data,kind='count', size=6, aspect=2)
g.set_xticklabels(rotation=45)

c:\users\ryan\appdata\local\programs\python\python36\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
c:\users\ryan\appdata\local\programs\python\python36\lib\site-packages\seaborn\categorical.py:3672: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Out[9]:

<seaborn.axisgrid.FacetGrid at 0x22298452ef0>

Part 2: Data Cleaning¶

I have three concerns for data cleaning:

Values that are 0
Values that are missing or null
Unncessary columns

I'll go through cleaning each of these

In [10]:

print(data[data['Latitude']==0].count(),'\n------------------------')
print(data[data['Longitude']==0].count())

Traffic Report ID    141
Published Date       141
Issue Reported       141
Location               0
Latitude             141
Longitude            141
Address              141
Status               137
Status Date          141
Day                  141
Time                 141
dtype: int64 
------------------------
Traffic Report ID    141
Published Date       141
Issue Reported       141
Location               0
Latitude             141
Longitude            141
Address              141
Status               137
Status Date          141
Day                  141
Time                 141
dtype: int64

Latitude and longitude have the same amount of zeros, and I will infer that these values are 0 for the same entries. I'll just drop any value that is way out of bounds from a normal Travis County latitude/longitude:

In [11]:

#remove incorrect latitude and longitudes
data = data.drop(data[(data['Longitude'] > -90) |
                      (data['Longitude'] < -100) | 
                      (data['Latitude'] > 40) |
                      (data['Latitude'] < 20)].index)

In [12]:

missing = data.isnull().sum().sort_values(ascending=False)
pct = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
#creates dataframe with missing and pct missing
miss_data = pd.concat([missing, pct], axis=1, keys=['Missing','Percent'])
#shows columns with missing values
miss_data[miss_data['Missing']>0]

Out[12]:

	Missing	Percent
Status	1689	2.414409
Longitude	24	0.034308
Latitude	24	0.034308
Location	24	0.034308

Drop missing data:

In [13]:

#gets rid of null values for longitude and by extension latitude
data = data[pd.notnull(data['Longitude'])]

Drop irrelevant columns:

In [14]:

data.drop(['Traffic Report ID', 
           'Published Date',
           'Location',
           'Address',
           'Status',
           'Status Date'],axis=1, inplace=True)

In [15]:

data = data.reset_index(drop=True)

In [16]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69931 entries, 0 to 69930
Data columns (total 5 columns):
Issue Reported    69931 non-null object
Latitude          69931 non-null float64
Longitude         69931 non-null float64
Day               69931 non-null datetime64[ns]
Time              69931 non-null object
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 2.7+ MB

Interactive Mapping¶

Traffic incident map for 2018¶

You can click on each individual marker to see the type of traffic incident:

In [17]:

#Austin Coordinates: 30.2672° N, 97.7431° W
ATX_COOR = [30.2672, -97.7431]
MAX_RECORDS = 1000

#create empty map zoomed in on Austin
map1 = folium.Map(location = ATX_COOR, zoom_start = 12)

#add marker for every record in data
for i in range(0,MAX_RECORDS):
    folium.Marker(location = [data.iloc[i]['Latitude'],data.iloc[i]['Longitude']], popup=data.iloc[i]['Issue Reported']).add_to(map1)

display(map1)

The map is a little crowded, but zooming in and looking around it looks like the two main highways (Loop 1 and I-35) and downtown Austin have the highest accident rates. This makes sense because they are the roads that see the highest amount of traffic. The next plot will show clusters of where traffic incidents occur. You can click on the clusters to further explore certain areas:

Cluster Map of traffic incidents in 2018¶

In [18]:

from folium.plugins import MarkerCluster

#create empty map zoomed in on Austin
map2 = folium.Map(location = ATX_COOR, zoom_start = 12)

mc = MarkerCluster()

#add marker for every record in data
for i in range(0,1500):
    mc.add_child(folium.Marker(location = [data.iloc[i]['Latitude'],data.iloc[i]['Longitude']],popup=data.iloc[i]['Issue Reported']))

map2.add_child(mc)
display(map2)

The first thing that jumped out was a huge amount of crashes at the intersection of I-35 and US-183 with 14/1500 accidents shown occuring at this one intersection in North Austin. Other major clusters include the intersection of TX-71 and I-35, downtown Austin, and the South Austin intersection of East Riverside Dr and I-35.

Animated Heatmaps¶

I chose August for my analyses because it is commonly cited as the highest crash rate month. The graph below shows that this is the case for Austin as well, with May also having around as many accidents as well.

Here is an interesting read on the most dangerous times to drive: https://www.bactrack.com/blogs/expert-center/35042821-the-most-dangerous-times-on-the-road

You can click on the slider on the bottom left to manually control the animated maps.

In [19]:

plt.figure(figsize=(12,6))
ax = sns.countplot(x=data['Day'].dt.month)
ax.set_xlabel('Month')
plt.show()

Down below I create a temporary data frame to store all entries where I deemed the traffic incident as an "accident", and then sort by the day of the month.

In [20]:

temp_df = data
temp_df = temp_df[temp_df['Day'].dt.month == 8]
temp_df = temp_df[(temp_df['Issue Reported'] == 'Crash Service') |
                  (temp_df['Issue Reported'] == 'Crash Urgent') |
                  (temp_df['Issue Reported'] == 'COLLISION') |
                  (temp_df['Issue Reported'] == 'TRFC HAZD/ DEBRIS') |
                  (temp_df['Issue Reported'] == 'COLLISION WITH INJURY') |
                  (temp_df['Issue Reported'] == 'COLLISION/ LVNG SCN') |
                  (temp_df['Issue Reported'] == 'COLLISION/PRIVATE PROPERTY') |
                  (temp_df['Issue Reported'] == 'VEHICLE FIRE') |
                  (temp_df['Issue Reported'] == 'AUTO/ PED') |
                  (temp_df['Issue Reported'] == 'Traffic Fatality')]
temp_df = temp_df.sort_values(by='Day')
temp_df = temp_df.reset_index(drop=True)
temp_df.tail()

Out[20]:

	Issue Reported	Latitude	Longitude	Day	Time
3431	Crash Urgent	30.285397	-97.685534	2018-08-31	17:38:27
3432	Crash Urgent	30.328245	-97.663181	2018-08-31	09:11:03
3433	Crash Urgent	30.274125	-97.731982	2018-08-31	00:53:33
3434	Crash Urgent	30.328145	-97.665969	2018-08-31	13:24:51
3435	Crash Urgent	30.232393	-97.720396	2018-08-31	23:55:39

In [21]:

g = sns.factorplot('Issue Reported', data=temp_df,kind='count', size=6, aspect=2)
g.set_xticklabels(rotation=45)

c:\users\ryan\appdata\local\programs\python\python36\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
c:\users\ryan\appdata\local\programs\python\python36\lib\site-packages\seaborn\categorical.py:3672: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Out[21]:

<seaborn.axisgrid.FacetGrid at 0x222ad13f908>

Most accidents seem to be "urgent crash" followed by "crash service", but many of these names could be similar accidents with different codes to them. For this reason, analyzing the type of crash seems pretty futile, so I will stick to lumping these together and analyzing their patterns:

August Accidents by Day of Month¶

In [22]:

heat_df = temp_df
heat_df = heat_df[['Latitude', 'Longitude', 'Day']]
heat_df = heat_df.dropna(axis=0)
heat_df['Weight'] = heat_df['Day'].dt.day.astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude','Weight'])

In [23]:

from folium import plugins
from folium.plugins import HeatMap

ATX_COOR = [30.2672, -97.7431]
austin_heatmap = folium.Map(location = ATX_COOR, 
                            tiles = 'Stamen Terrain', 
                            zoom_start = 11)

heat_data = [[[row['Latitude'],row['Longitude']]
             for index, row in heat_df[heat_df['Weight']==i].iterrows()]
            for i in range(1,32)]

hm = plugins.HeatMapWithTime(data=heat_data,display_index=True,max_opacity=0.8)
hm.add_to(austin_heatmap)

austin_heatmap

Out[23]:

It looks like the distribution of accidents from day to day is random, but it may be due to a day being a weekend or not. Some days tend to have accidents mainly concentrated on I-35 while others are evenly dispursed throughout the county.

August Accidents by Day of Week¶

In [26]:

heat_df2 = temp_df
heat_df2 = heat_df2.dropna(axis = 0)
heat_df2['Weight'] = heat_df2['Day'].apply(lambda x: x.dayofweek).astype(float)
heat_df2 = heat_df2.dropna(axis=0, subset=['Latitude','Longitude','Weight'])

In [27]:

heat_df2.head()

Out[27]:

	Issue Reported	Latitude	Longitude	Day	Time	Weight
0	Crash Service	30.219063	-97.718603	2018-08-01	18:53:23	2.0
1	Crash Urgent	30.172456	-97.799554	2018-08-01	14:15:34	2.0
2	Crash Urgent	30.252034	-97.741494	2018-08-01	16:09:22	2.0
3	Crash Urgent	30.252034	-97.741494	2018-08-01	16:07:08	2.0
4	Crash Service	30.160985	-97.816086	2018-08-01	17:07:48	2.0

In [28]:

austin_heatmap2 = folium.Map(location = ATX_COOR,
                            tiles = 'Stamen Terrain', 
                            zoom_start = 11)

heat_data2 = [[[row['Latitude'],row['Longitude']]
             for index, row in heat_df2[heat_df2['Weight']==i].iterrows()]
            for i in range(7)]


hm = plugins.HeatMapWithTime(data=heat_data2,display_index=True,max_opacity=0.8)
hm.add_to(austin_heatmap2)

austin_heatmap2

Out[28]:

This shows accidents where 1 = Monday and 7 = Sunday. More accidents occur on weekdays, but strangely enough there is no major pattern change on weekends. It seems that the distribution of accidents could be more dependent on other factors not in the dataset such as weather. Lets see how time of the day affects accidents:

August Accidents by Hour of Day¶

In [29]:

heat_df3 = temp_df
heat_df3 = heat_df3.dropna(axis = 0)
heat_df3['Weight'] = heat_df3['Time'].apply(lambda x: x.hour).astype(float)
heat_df3 = heat_df3.dropna(axis=0, subset=['Latitude','Longitude','Weight'])

In [30]:

ATX_COOR = [30.2672, -97.7431]
austin_heatmap3 = folium.Map(location = ATX_COOR,
                            tiles = 'Stamen Terrain', 
                            zoom_start = 11)

heat_data3 = [[[row['Latitude'],row['Longitude']]
             for index, row in heat_df3[heat_df3['Weight']==i].iterrows()]
            for i in range(24)]


hm = plugins.HeatMapWithTime(data=heat_data3,index=[i for i in range(24)],display_index=True,max_opacity=0.8)
hm.add_to(austin_heatmap3)

austin_heatmap3

Out[30]:

This graph shows accidents by hour in military time starting with midnight-1am as "0". Here is what I can make out:

The rate of accidents seems to stay strong until around 4am.
The hours of 5 and 6 seem to have very little accidents, which makes sense because very few cars are on the road.
The hours of 7 and 8 seem to be all centrally located around inner Austin where most people commute for their jobs.
There is almost no accidents until 12 when it picks up and is spread throughout Austin.
These accidents increase steadily until they peak around the hour of 10pm. I expected there to be more crashes during rush hour, but it seems that there are the most accidents as night begins.

Summary¶

This project has a lot of potential for pattern exploration that I did not get into. My code is highly reusable for any location as long as you have coordinate points and the coordinates to your own city. There is also a lot more room to explore other questions such as which zip codes have the most accidents, which areas are more likely to have certain types of accidents, and traffic by hour on weekdays vs. weekends to name a few. I went through data exploration, data cleaning, and took on a more analytical role to uncover patterns in traffic without the help of machine learning. I hope you enjoyed the post, and feel free to reuse my code to uncover patterns in your own city!

Beyond Data

Animated Mapping of Travis County Accidents