- Sun 07 March 2021
- datascience
- #network analysis, #topic modeling, #word cloud, #psychology
Project Purpose¶
This project was made with the intent of understanding network analysis and topic modeling and how these can be applied to real-world scenarios. My current course, social media analytics, has shown me the potential value of understanding group dynamics and this project serves to solidify the application of this valuable skill. Lastly, I would like to delve into the psychology of extremism to better understand the context behind the actions of these bad actors.
Network analysis is a set of integrated techniques to show relationships among actors and to analyze the social structures that are created from these relations. This provides a basis to better understand social phenomena by analyzing these relations.
Topic modeling is a type of statistical model for discovering topics that occur in documents. This technique is frequently used to discover hiddent semantic structures in text.
An important note, ISIL lost all territory in March 2019. While supporters still exist, they are no longer an organization with any leaders. Most of the Twitter users in this dataset are now banned from the platform.
About the Data¶
The data are 17,000 tweets from 100+ pro-ISIS profiles from across the world since the November 2015 Paris attacks. The original purpose of the dataset was to gain community insights from Kaggle users to develop counter-measures to extremists working abroad. The dataset has 8 columns:
- Name
- Twitter Username
- Description
- Location
- Number of followers at the time the tweet was downloaded
- Number of statuses by the user when the tweet was downloaded
- Date and timestamp of tweet
- The tweet itself
Content¶
- Exploratory data analysis
- Network analysis
- Topic modeling
- Insights into the psychology of extremism
Imports¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
import networkx as nx
import plotly.graph_objects as go
import networkx as nx
import langid
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
%matplotlib inline
plt.style.use('ggplot')
df = pd.read_csv('tweets.csv')
df.head()
1. Exploratory Data Analysis¶
df.info()
There are several thousand missing values for "description" and "location" columns, but otherwise the dataset is complete.
#counting tweets and retweets
retweets = []
originaltweets = []
for user, tweet in zip(df['username'], df['tweets']):
match = re.search(r'^\bRT\b', tweet)
if match == None:
originaltweets.append([user,tweet])
else:
retweets.append([user,tweet])
print('The number of original tweets is ',len(originaltweets),'and the number of retweets is ',len(retweets))
# language distribution
# identify languages
predicted_languages = [langid.classify(tweet) for tweet in df['tweets']]
lang_df = pd.DataFrame(predicted_languages, columns=['language','value'])
# show the top ten languages & their counts
print(lang_df['language'].value_counts().head(10))
# plot the counts for the top ten most commonly used languages
colors=sns.color_palette('hls', 10)
pd.Series(lang_df['language']).value_counts().head(10).plot(kind = "bar",
figsize=(12,9),
color=colors,
fontsize=14,
rot=45,
title = "Top 10 most common languages")
The languages are a large majority English. This is indicative of this dataset and may or may not be representative of the languages spoken on twitter by pro-ISIS members.
#search for mentioned users who are in tweets.csv
present = []
not_present = []
for record in originaltweets:
match = re.findall(r'@\w*', record[1])
if match != []:
for name in match:
if (name[1:] in df['username'].unique()) and (record[0] != name[1:]):
present.append([record[0], name[1:]])
elif record[0] != name[1:]:
not_present.append([record[0], name[1:]])
present = np.array(present)
not_present = np.array(not_present)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_title('Users present vs. not present in tweets.csv')
counts = ['Present', 'Not Present']
values = [len(np.unique(present[:,1])), len(np.unique(not_present[:,1]))]
ax.bar(counts, values)
The large majority of mentions are of users not present in the dataset. Since I am doing a network analysis, I am only concerned with the connections between others within the dataset.
top5_senders = Counter(present[:,0]).most_common(5)
top5_receivers = Counter(present[:,1]).most_common(5)
print('top 5 senders\n-------------\n', top5_senders, '\n')
print('top 5 receivers\n---------------\n', top5_receivers)
The top 5 mention receivers seem to be: RamiAlLolah, Nidalgazaui, MilkSheikh2, WarReporter1, and _IshfaqAhmad. Their descriptions may be helpful to gain context on these accounts:
for name, _ in top5_receivers:
print("Username: {} - {}\n".format(name,
df[df['username'] == name]['description'].dropna().unique()[0]))
3 of the 5 accounts position themselves as news sources while the other 2 seem to be individuals. Next, I can further this dive into users by connecting them via a network analysis.
2. Network Analysis¶
#network
graph=nx.Graph()
all_users = list(set(present[:,0]) | set(present[:,1]))
edges = {}
counter = Counter(map(tuple, present))
for (sender, receiver), count in counter.items():
if (receiver, sender) in edges.keys():
edges[(receiver, sender)] = edges[(receiver, sender)] + count
else:
edges[(sender, receiver)] = count
for (sender, receiver), count in edges.items():
graph.add_edge(sender, receiver, weight=count)
followers = {}
tweet_num = {}
for username in all_users:
followers[username] = df[df['username'] == username]['followers'].unique()[-1]
tweet_num[username] = df[df['username'] == username]['tweets'].count()
sizes = [(followers[n] / tweet_num[n]) * 50 for n in graph.nodes()]
weights = [graph[u][v]['weight']/2 for u, v in graph.edges()]
plt.figure(figsize=(12,12))
nx.draw(graph, pos=nx.spring_layout(graph), with_labels=True, width=weights)
plt.show()
import pandas as pd
degree = nx.degree_centrality(graph) # calculate dgree
betweenness = nx.betweenness_centrality(graph) # calculate betweenness
closeness = nx.closeness_centrality(graph) # Calculate closeness
centrality = pd.DataFrame([degree, betweenness, closeness]).T
centrality.reset_index(inplace = True)
centrality.columns = ['Username', 'Degree', 'Betweenness', 'Closeness']
centrality.head()
Due to the large amount of nodes in the network, the graph is somewhat difficult to interpret. To do so, I have collected 3 metrics of centrality from the network:
- Degree: Assigns importance based simply on the number of links by a node to other nodes. This is useful to identifying the most popular/connected users.
- Betweenness: Measures the number of times a node lies on the shortest path between other nodes. This can be useful to identify those who connect disparate parts of the network. A user with high betweenness is important for transmitting new info, ideas, and opportunities to a wide audience.
- Closeness: Measures how close a node is to all other nodes. This can help identify which users can reach the whole network the fastest.
#top 5 degree centralities
centrality.sort_values(by=['Degree'],ascending=False)[:5]
#top 5 betweenness centralities
centrality.sort_values(by=['Betweenness'],ascending=False)[:5]
#top 5 closeness centralities
centrality.sort_values(by=['Closeness'],ascending=False)[:5]
MaghrabiArabi, RamiAlLolah, Uncle_SamCoCo, Nidalgazaui, and WarReporter1 are the top 5, respectively, for all 3 measures of centrality. This is strong evidence that these 5 actors play a significant role in the pro-ISIS twitter network.
4. Topic Modeling¶
#word cloud
junk = re.compile("al|RT|\n|&.*?;|http[s](?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)*")
tweets = [junk.sub(" ", t) for t in df.tweets]
vec = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=.5)
tfv = vec.fit_transform(tweets)
terms = vec.get_feature_names()
wc = WordCloud(height=1000, width=1000, max_words=1000).generate(" ".join(terms))
plt.figure(figsize=(10, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()
I first ran a word cloud to gain some context as to the content users are talking about. From a glance,it seems that users talk the most about war (e.g., "killed" and "attack"), religion (e.g., "muslim" and "allah"), and politics (e.g., "syria" and "assad"). Topic modeling can better help determine the different categories that users are talking about.
from sklearn.decomposition import NMF
nmf = NMF(n_components=8).fit(tfv)
for idx, topic in enumerate(nmf.components_):
print("Topic #{}:".format(str(idx+1)))
print(" ".join([terms[i] for i in topic.argsort()[:-10 - 1:-1]]))
print("")
Playing around with the number of topics, the optimal number seems to be around 8 topics. These topics are loosely:
- Countries involved
- Islamic state
- ISIS leaders
- Fights and skirmishes
- Rebels
- Twitter users
- Islamic state (again)
- Religion
from matplotlib import style
style.use('bmh')
df['topic'] = np.argmax(nmf.transform(vec.transform(tweets)), axis=1)
top5_users = df[df.username.isin(['MaghrabiArabi',
'RamiAlLolah',
'Uncle_SamCoco',
'Nidalgazaui',
'WarReporter1'])]
pd.crosstab(top5_users.username, top5_users.topic).plot.bar(stacked=True, figsize=(16, 10), colormap="coolwarm")
Combining my network analysis and topic modeling, I can create further insights. I can look into the topics that the top 5 most important players I found from the network analysis and analyze the topics that they tend to talk about. Here, we see some interesting trends:
- Uncle_SamCoCo and MaghrabiArabi talk about religion a large majority of the time
- RamiAlLolah talks mostly about countries' involvement. This makes sense as his description includes: "Forecasted many Israeli strikes in Syria/Lebanon."
- WarReporter1 and Nidalgazaui cover a larger spectrum of topics
4. Insights into Psychology of Extremism¶
The purpose of this project was to gain insights into the behavior of Pro-ISIS users on Twitter, but data can only go so far. To gain context on these insights, psychology is needed to understand why these bad actors behave this way in the first place.
Firstly, what is extremism? There is debate as to how to define extremism, but I prefer the definition that states that it is possessing beliefs that in some way increase one's propensity for violoence against other groups and that this behavior occurs within a culture where such violence is not tolerated or expected. This is helpful to isolate the definition of extremism from any particular set of political or religious beliefs.
In a published paper by Andreas Beelmann, "A Social-Developmental Model of Radicalization: A Systematic Integration of Existing Theories and Empirical Research," Beelmann argues that radicalization occurs in 3 steps:
- Ontogenetic development processes
- Proximal radicalization processes
- Extremist attitudes/opinions and behavior/action
In other words, maturity from childhood to adulthood, being near similar actors, and eventually taking on these behaviors. During maturity, Beelmann lists 3 distinct risk factors:
- Societal risk factors (e.g., real intergroup conflicts, intergroup threats, and the prevalence of ideologies legitimizing violence)
- Social risk factors (e.g., violence in the home, the experience of group discrimination, minimal social diversity)
- Individual risk factors (e.g., personality characteristics that favor domination/authoritarianism, self-esteem problems, antisocial behavior)
So, what can we do about these factors? There is no easy solution, but research shows that it is somewhat clear that developmental conditions combined with a lack of protective factors serve as a recipe for individuals to transition into extremism. Immediate action from data insights will help tremendously to detect and stifle extremist activity. However, long-term solutions will require a combination of data, policies, and support geared towards those individuals and communities susceptible to extremism if we are to most effectively tackle the root cause of this problem.