Category Archives: Python

Data-mining Twitter for GamerGate—Visualization

In the previous posting, I went over how to connect to Twitter’s streaming API using a connector app and the Tweepy Python library, as well as a quick overview of how to construct a Pandas dataframe from the tweets we’ve collected.

In this posting, we’ll extract all of the information we’ll need to use NetworkX to create a directed graph that we can visualize in Gephi of who’s retweeting whom, keeping track of the age in days and the number of followers that each user has so we can filter on those factors if we like.

First, if you don’t have NetworkX, install it with pip, and download and install Gephi.

Again, we’ll assume that our tweets are collected in a text file, “gamergate.txt”. Let’s pull the data out of the text file into a new data frame.

import json
import re
import pandas as pd
from time import gmtime, mktime, strptime

tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue
#
# Clean out limit messages, etc.
#
for tweet in tweets_data:
    try:
        user = tweet['user']
    except:
        tweets_data.remove(tweet)

for tweet in tweets_data:
    try:
        user = tweet['text']
    except:
        tweets_data.remove(tweet)

#
# See how many we wound up with
#
print len(tweets_data)

#
# Pull the data we're interested in out of the Twitter data we captured
#
rows_list = []
now = mktime(gmtime())
for tweet in tweets_data:
    author = ""
    rtauthor = ""
    age = rtage = followers = rtfollowers = 0
#
# If it was a retweet, get both the original author and the retweeter, save the original author's
# follower count and age
#
    try:
        author = tweet['user']['screen_name']
        rtauthor = tweet['retweeted_status']['user']['screen_name']
        rtage = int(now - mktime(strptime(tweet['retweeted_status']['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
        rtfollowers = tweet['retweeted_status']['user']['followers_count']
    except:
#
# Otherwise, just get the original author
#
        try:
            author = tweet['user']['screen_name']
        except:
            continue
#
# If this was a reply, save the screen name being replied to
#
    reply_to = ""
    if (tweet['in_reply_to_screen_name'] != None):
        reply_to = tweet['in_reply_to_screen_name']
#
# Calculate the age, in days, of this Twitter ID
#
    age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
#
# Grab this ID's follower count and the text of the tweet
#
    followers = tweet['user']['followers_count']
    text = tweet['text']
    dict1 = {}
#
# Construct a row, add it to our list
#
    dict1.update({'author': author, 'reply_to': reply_to, 'age': age, 'followers': followers, 'retweet_of': rtauthor, 'rtfollowers': rtfollowers, 'rtage': rtage, 'text': text})
    rows_list.append(dict1)

#
# When we've processed all the tweets, build the DataFrame from the rows
# we've collected
#
tweets = pd.DataFrame(rows_list)

Here’s a script that will iterate through the dataframe, row by row, and construct a directed graph of who’s retweeting whom. Each directed edge represented the relationship “is retweeted by”, the higher the weight of an edge, the more person B is getting retweeted by person A. Each node represents an individual ID on Twitter, and has attributes to track the number of followers and the age of the ID in days.

import networkx as nx

#
# Create a new directed graph
#
J = nx.DiGraph()
#
# Iterate through the rows of our dataframe
#
for index, row in tweets.iterrows():
#
# Gather the data out of the row
#
    this_user_id = row['author']
    author = row['retweet_of']
    followers = row['followers']
    age = row['age']
    rtfollowers = row['rtfollowers']
    rtage = row['rtage']
#
# Is the sender of this tweet in our network?
#
    if not this_user_id in J:
        J.add_node(this_user_id, attr_dict={
                'followers': row['followers'],
                'age': row['age'],
            })
#
# If this is a retweet, is the original author a node?
#
    if author != "" and not author in J:
        J.add_node(author, attr_dict={
                'followers': row['rtfollowers'],
                'age': row['rtage'],
            })
#
# If this is a retweet, add an edge between the two nodes.
#
    if author != "":
        if J.has_edge(author, this_user_id):
            J[author][this_user_id]['weight'] += 1
        else:
            J.add_weighted_edges_from([(author, this_user_id, 1.0)])

nx.write_gexf(J, 'ggrtages.gexf')

The last thing we did was to save out a GEFX file we can then read into Gephi. Start Gephi up, and open our file; we called ours “ggrtages.gexf”.

gephiScreenSnapz013

You’ll get a dialog telling you how many nodes and edges there are in the graph, whether it’s directed or not, and other information, warnings, etc. Click “OK”.

gephiScreenSnapz014

Gephi will import the GEFX file. You can now look at the information it contains by clicking on the “Data Laboratory” button at the top.

gephiScreenSnapz015

Click on the “Overview” button to start working with the network. At first, it doesn’t look like anything, since we haven’t actually run a visualization on it. Before we do, we can use some of the node attributes to color nodes a darker blue based on their age.

gephiScreenSnapz016

We can use the “Ranking” settings to color our nodes. Click on the “Select attribute” popup, and choose “age”.

gephiScreenSnapz017

You can choose difference color schemes, change the spline curve used to apply color, etc., from here as well.

gephiScreenSnapz018

Click on the “Apply” button to apply the ranking to the network. The nodes will now be colored rather than gray.

gephiScreenSnapz019

Now, we’re ready to run a visualization on our data. From the “Layout” section, let’s choose “ForceAtlas 2″—it’s fast and good at showing relationships in a network.

gephiScreenSnapz020

Press the “Run” button, and let it go for a bit. A network this size—about 10K nodes and 30K edges—settled down on my MacBook Pro within five minutes or less. When you feel it’s stabilized into something interesting, press the “Stop” button, and then click on the “Preview” button at the top.

gephiScreenSnapz022

The preview panel won’t show anything at first. Click the “Refresh” button.

gephiScreenSnapz023

Gephi will render your visualization. You can use the mouse to drag it around, and you can zoom in and out with a scroll-wheel or with the “+” and “-” buttons below.

gephiScreenSnapz024

gephiScreenSnapz025

gephiScreenSnapz026

Mining Twitter for #GamerGate: A How-To

I’ve gotten interested in the #GamerGate “controversy”—I’m pretty completely persuaded that any talk about “ethics” is a façade for a lot of reactionary nonsense, as well as abundant harassment and misogny—and it occurred to me that it represented an interesting data set to mine using Python. This is a quick guide for how to get started, but it could be adapted to any effort to datamine Twitter.

Setting Up to Connect to Twitter

First, you’re going to need to set up a Twitter app that you can use for authentication. You can do this at apps.twitter.com/app/new. You’ll need to have a valid Twitter account with an authenticated phone number.

Enter a name, description and web site URL for your application. You won’t need a callback URL.

FirefoxDeveloperEditionScreenSnapz068

Check “Yes, I agree” at the bottom of the Developer Agreement, and click the “Create your Twitter application” button.

FirefoxDeveloperEditionScreenSnapz069

Your application will be created. To use Tweepy to capture tweets, we’ll need the Consumer Key and Consumer Secret, and we’ll also need to set up an access token. Click on the “manage keys and access tokens” link next to your “Consumer Key (API Key)” in the “Application Settings” section.

FirefoxDeveloperEditionScreenSnapz070

This will take you to the “Keys and Access Tokens” tab. Note your “Consumer Key” and “Consumer Secret” (greyed out here).

FirefoxDeveloperEditionScreenSnapz071

In the “Your Access Token” section at the bottom of the page, click on “Create my access token”.

FirefoxDeveloperEditionScreenSnapz072

An “Access Token” and an “Access Token Secret” — again, greyed out here — will be generated, you’ll need these as well.

FirefoxDeveloperEditionScreenSnapz073

Install the Python Prerequisites

For this project, we’re going to need the Tweepy, Pandas, and matplotlib libraries

pip install tweepy pandas matplotlib

Here’s a simple-minded Python script using Tweepy to collect tweets mentioning “gamergate” from the Twitter streaming API:

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

access_token = "YOUR ACCESS TOKEN GOES HERE"
access_token_secret = "YOUR ACCESS TOKEN SECRET GOES HERE"
consumer_key ="YOUR CONSUMER KEY GOES HERE"
consumer_secret = "YOUR CONSUMER KEY SECRET GOES HERE"

class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status

if __name__ == '__main__':

    listener = StdOutListener()
    auth_handler = OAuthHandler(consumer_key, consumer_secret)
    auth_handler.set_access_token(access_token, access_token_secret)
    stream = Stream(auth_handler, listener)

    stream.filter(track=['gamergate'])

UPDATE

The script, as it stands, times out on a read every once in a while, so there’s a minor improvement to be had here by embedding the collection in a while loop with a try and an except to keep it from crashing back to the shell prompt occasionally:

    while True:
        try:
            stream.filter(track=['gamergate'])
        except:
            continue

All this script does is print out every tweet which is captured by Tweepy, in JSON format. If you run it, the output will look something like this — this is a single tweet in JSON notation:

{u'contributors': None, u'truncated': False, u'text': u'RT @CommissarOfGG: Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate ht\u2026', 'retweet': True, u'in_reply_to_status_id': None, u'id': 584828601125773314, u'favorite_count': 0, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1428268978755', u'entities': {u'symbols': [], u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [126, 136], u'text': u'GamerGate'}], u'user_mentions': [{u'id': 2729513808, u'indices': [3, 17], u'id_str': u'2729513808', u'screen_name': u'CommissarOfGG', u'name': u'Comrade Commissar'}], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828601125773314', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate http://t.co/CS3Kb2Bkcm', u'in_reply_to_status_id': None, u'id': 584828243808661504, u'favorite_count': 2, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [107, 117], u'text': u'GamerGate'}], u'user_mentions': [], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828243808661504', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2729513808, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 2047, u'profile_sidebar_border_color': u'000000', u'id_str': u'2729513808', u'profile_background_color': u'000000', u'listed_count': 26, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -14400, u'statuses_count': 4940, u'description': u'#GamerGate #OpSKYNET', u'friends_count': 1584, u'location': u'Moscow', u'profile_link_color': u'DD2E44', u'profile_image_url': u'http://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2729513808/1407939361', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Comrade Commissar', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 1980, u'screen_name': u'CommissarOfGG', u'notifications': None, u'url': u'http://www.facebook.com/commissarofgamergate', u'created_at': u'Wed Aug 13 14:10:24 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Eastern Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:21:33 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2784597626, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 986, u'profile_sidebar_border_color': u'000000', u'id_str': u'2784597626', u'profile_background_color': u'000000', u'listed_count': 35, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -18000, u'statuses_count': 25217, u'description': u"I wasn't born with enough middle fingers for perpetually outraged hipster douchebags compensating for their mediocrity with shelves of participation trophies.", u'friends_count': 785, u'location': u'Parts Unknown', u'profile_link_color': u'4A913C', u'profile_image_url': u'http://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2784597626/1425335831', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Unnecessary Robness', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 15912, u'screen_name': u'aDouScheiBler', u'notifications': None, u'url': None, u'created_at': u'Mon Sep 01 19:36:40 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Central Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:22:58 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}

Set a terminal running the script above for as long as you like. I left mine going for 42 hours, and collected about 65000 tweets in a text file about 300MB long.

python tweetminer.py >> gamergate.txt

When you’ve collected your data, here’s some Python to set up a sample pandas DataFrame containing information of interest: who tweeted, how many days old their account is, how many followers they have, who it was a retweet of (if it was one) and to whom it was a reply (if it was one).

That should give you plenty of grist for analysis.

import json
import pandas as pd
import matplotlib.pyplot as plt
from time import gmtime, mktime, strptime

tweets_data_path = 'gamergate.txt'

tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue
#
# Clean out limit messages, etc.
#
for tweet in tweets_data:
    try:
        user = tweet['user']
    except:
        tweets_data.remove(tweet)

print len(tweets_data)

#
# Pull the data we're interested in out of the Twitter data we captured
#
rows_list = []
now=mktime(gmtime())
for tweet in tweets_data:
    author = ""
    rtauthor = ""
#
# If it was a retweet, get both the original author and the retweeter
#
    try:
        author = tweet['user']['screen_name']
        rtauthor = tweet['retweeted_status']['user']['screen_name']
    except:
#
# Otherwise, just get the original author
#
        try:
            author = tweet['user']['screen_name']
        except:
            continue

    reply_to = ""
    if (tweet['in_reply_to_screen_name'] != None):
        reply_to = tweet['in_reply_to_screen_name']
    
    age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y"))/(60*60*24))
    followers = tweet['user']['followers_count']
    dict1 = {}
    dict1.update({'author': author, 'retweet_of': rtauthor, 'reply_to': reply_to, 'age': age, 'followers': followers})
    rows_list.append(dict1)

tweets = pd.DataFrame(rows_list)

The resulting DataFrame will look something like this—note that rows 0-4 are retweets, and row 6 is a reply; “age” is days since the Twitter ID was created:

        age           author  followers     reply_to       retweet_of
0       137      Maskgamer64        428                  CultOfVivian
1       231   Smackfacemcgee       1304                  Daddy_Warpig
2      2240      LenFirewood       1658                   RSG_VILLENA
3       171     8bitsofsound        650                 CommissarOfGG
4       102    devilstwosome          9                   atlasnodded
5        24       tophatdril         34                              
6        11   TheRalphRetart         63     Dr_Louse                 
7...    ...              ...        ...          ...              ...
64531    65     4EverPlayer2        614                        mombot
64532   143  EnwroughtDreams        222                thewtfmagazine
64533  1996          _icze4r      22689                       dauthaz
64534  1581  __DavidFlanagan       8315                   Spacekatgal
64535   872         jtdg_b8z        621               GamingAndPandas
64536  2238        hanytimeh        914                thewtfmagazine

At this point you could easily find out the most-retweeted IDs in the DataFrame, for example:

In [146]: tweets['retweet_of'].value_counts()
Out[146]: 
                   17974
Sargon_of_Akkad     1574
ItalyGG             1516
TheRalphRetort      1064
Blaugast             910
mylittlepwnies3      899
thewtfmagazine       823
Nero                 721
srhbutts             706
Daddy_Warpig         705
randomfox            627
atlasnodded          592
full_mcintosh        586
whenindoubtdo        584
ToKnowIsToBe         569
...

Check out the follow-on posting to see how to use NetworkX and Gephi to make visualizations of the data.

The Micro Python pyboard Arrived!

The version 1.0 pyboard that I ordered from the Micro Python project arrived in the mail today. It’s amazingly small.

IMG_1115-0.JPG

The board supports a REPL shell, accessible via the same USB cable that provides power, and has a number of LEDs, timers, a user-assignable switch, and an accelerometer framework. I’ll be putting together a review pretty shortly, but there are a million things going on, suddenly.

Stay tuned.