I’ve gotten interested in the #GamerGate “controversy”—I’m pretty completely persuaded that any talk about “ethics” is a façade for a lot of reactionary nonsense, as well as abundant harassment and misogny—and it occurred to me that it represented an interesting data set to mine using Python. This is a quick guide for how to get started, but it could be adapted to any effort to datamine Twitter.
Setting Up to Connect to Twitter
First, you’re going to need to set up a Twitter app that you can use for authentication. You can do this at apps.twitter.com/app/new. You’ll need to have a valid Twitter account with an authenticated phone number.
Enter a name, description and web site URL for your application. You won’t need a callback URL.

Check “Yes, I agree” at the bottom of the Developer Agreement, and click the “Create your Twitter application” button.

Your application will be created. To use Tweepy to capture tweets, we’ll need the Consumer Key and Consumer Secret, and we’ll also need to set up an access token. Click on the “manage keys and access tokens” link next to your “Consumer Key (API Key)” in the “Application Settings” section.

This will take you to the “Keys and Access Tokens” tab. Note your “Consumer Key” and “Consumer Secret” (greyed out here).

In the “Your Access Token” section at the bottom of the page, click on “Create my access token”.

An “Access Token” and an “Access Token Secret” — again, greyed out here — will be generated, you’ll need these as well.

Install the Python Prerequisites
For this project, we’re going to need the Tweepy, Pandas, and matplotlib libraries
pip install tweepy pandas matplotlib
Here’s a simple-minded Python script using Tweepy to collect tweets mentioning “gamergate” from the Twitter streaming API:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
access_token = "YOUR ACCESS TOKEN GOES HERE"
access_token_secret = "YOUR ACCESS TOKEN SECRET GOES HERE"
consumer_key ="YOUR CONSUMER KEY GOES HERE"
consumer_secret = "YOUR CONSUMER KEY SECRET GOES HERE"
class StdOutListener(StreamListener):
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
if __name__ == '__main__':
listener = StdOutListener()
auth_handler = OAuthHandler(consumer_key, consumer_secret)
auth_handler.set_access_token(access_token, access_token_secret)
stream = Stream(auth_handler, listener)
stream.filter(track=['gamergate'])
UPDATE
The script, as it stands, times out on a read every once in a while, so there’s a minor improvement to be had here by embedding the collection in a while loop with a try and an except to keep it from crashing back to the shell prompt occasionally:
while True:
try:
stream.filter(track=['gamergate'])
except:
continue
All this script does is print out every tweet which is captured by Tweepy, in JSON format. If you run it, the output will look something like this — this is a single tweet in JSON notation:
{u'contributors': None, u'truncated': False, u'text': u'RT @CommissarOfGG: Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate ht\u2026', 'retweet': True, u'in_reply_to_status_id': None, u'id': 584828601125773314, u'favorite_count': 0, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1428268978755', u'entities': {u'symbols': [], u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [126, 136], u'text': u'GamerGate'}], u'user_mentions': [{u'id': 2729513808, u'indices': [3, 17], u'id_str': u'2729513808', u'screen_name': u'CommissarOfGG', u'name': u'Comrade Commissar'}], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828601125773314', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate http://t.co/CS3Kb2Bkcm', u'in_reply_to_status_id': None, u'id': 584828243808661504, u'favorite_count': 2, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [107, 117], u'text': u'GamerGate'}], u'user_mentions': [], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828243808661504', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2729513808, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 2047, u'profile_sidebar_border_color': u'000000', u'id_str': u'2729513808', u'profile_background_color': u'000000', u'listed_count': 26, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -14400, u'statuses_count': 4940, u'description': u'#GamerGate #OpSKYNET', u'friends_count': 1584, u'location': u'Moscow', u'profile_link_color': u'DD2E44', u'profile_image_url': u'http://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2729513808/1407939361', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Comrade Commissar', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 1980, u'screen_name': u'CommissarOfGG', u'notifications': None, u'url': u'http://www.facebook.com/commissarofgamergate', u'created_at': u'Wed Aug 13 14:10:24 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Eastern Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:21:33 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2784597626, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 986, u'profile_sidebar_border_color': u'000000', u'id_str': u'2784597626', u'profile_background_color': u'000000', u'listed_count': 35, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -18000, u'statuses_count': 25217, u'description': u"I wasn't born with enough middle fingers for perpetually outraged hipster douchebags compensating for their mediocrity with shelves of participation trophies.", u'friends_count': 785, u'location': u'Parts Unknown', u'profile_link_color': u'4A913C', u'profile_image_url': u'http://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2784597626/1425335831', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Unnecessary Robness', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 15912, u'screen_name': u'aDouScheiBler', u'notifications': None, u'url': None, u'created_at': u'Mon Sep 01 19:36:40 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Central Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:22:58 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}
Set a terminal running the script above for as long as you like. I left mine going for 42 hours, and collected about 65000 tweets in a text file about 300MB long.
python tweetminer.py >> gamergate.txt
When you’ve collected your data, here’s some Python to set up a sample pandas DataFrame containing information of interest: who tweeted, how many days old their account is, how many followers they have, who it was a retweet of (if it was one) and to whom it was a reply (if it was one).
That should give you plenty of grist for analysis.
import json
import pandas as pd
import matplotlib.pyplot as plt
from time import gmtime, mktime, strptime
tweets_data_path = 'gamergate.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
#
# Clean out limit messages, etc.
#
for tweet in tweets_data:
try:
user = tweet['user']
except:
tweets_data.remove(tweet)
print len(tweets_data)
#
# Pull the data we're interested in out of the Twitter data we captured
#
rows_list = []
now=mktime(gmtime())
for tweet in tweets_data:
author = ""
rtauthor = ""
#
# If it was a retweet, get both the original author and the retweeter
#
try:
author = tweet['user']['screen_name']
rtauthor = tweet['retweeted_status']['user']['screen_name']
except:
#
# Otherwise, just get the original author
#
try:
author = tweet['user']['screen_name']
except:
continue
reply_to = ""
if (tweet['in_reply_to_screen_name'] != None):
reply_to = tweet['in_reply_to_screen_name']
age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y"))/(60*60*24))
followers = tweet['user']['followers_count']
dict1 = {}
dict1.update({'author': author, 'retweet_of': rtauthor, 'reply_to': reply_to, 'age': age, 'followers': followers})
rows_list.append(dict1)
tweets = pd.DataFrame(rows_list)
The resulting DataFrame will look something like this—note that rows 0-4 are retweets, and row 6 is a reply; “age” is days since the Twitter ID was created:
age author followers reply_to retweet_of
0 137 Maskgamer64 428 CultOfVivian
1 231 Smackfacemcgee 1304 Daddy_Warpig
2 2240 LenFirewood 1658 RSG_VILLENA
3 171 8bitsofsound 650 CommissarOfGG
4 102 devilstwosome 9 atlasnodded
5 24 tophatdril 34
6 11 TheRalphRetart 63 Dr_Louse
7... ... ... ... ... ...
64531 65 4EverPlayer2 614 mombot
64532 143 EnwroughtDreams 222 thewtfmagazine
64533 1996 _icze4r 22689 dauthaz
64534 1581 __DavidFlanagan 8315 Spacekatgal
64535 872 jtdg_b8z 621 GamingAndPandas
64536 2238 hanytimeh 914 thewtfmagazine
At this point you could easily find out the most-retweeted IDs in the DataFrame, for example:
In [146]: tweets['retweet_of'].value_counts()
Out[146]:
17974
Sargon_of_Akkad 1574
ItalyGG 1516
TheRalphRetort 1064
Blaugast 910
mylittlepwnies3 899
thewtfmagazine 823
Nero 721
srhbutts 706
Daddy_Warpig 705
randomfox 627
atlasnodded 592
full_mcintosh 586
whenindoubtdo 584
ToKnowIsToBe 569
...
Check out the follow-on posting to see how to use NetworkX and Gephi to make visualizations of the data.