Saving a Twitter Timeline to Pandas for Analysis

in #programming6 years ago

My application for a Twitter developer account was approved, and so I wrote my first program using the Twitter API today. It uses the twython library to retrieve a particular user's timeline and saves the timestamps, text, and like/retweet counts to a Pandas dataframe.

A few notes:

  • I hate that Twitter doesn't use ISO 8601 timestamps, unlike the Steem API. They look like YYYY-MM-DDTHH:MM:SS so you can just sort and compare them as strings. Nor does it use any of the other perfectly good standards, it looks like "Thu Apr 06 15:28:43 +0000 2017" so the entire first page of "Twitter API date format" results in Google is "how the heck do I parse this in my favorite programming language." The result I got from StackOverflow uses the email date parser.
  • I also hate that it seems standard these days to make it impossible to avoid overlap in REST APIs. You can query for a start point or and end point, but they are inclusive. Is there a good design reason I'm missing here?
  • The twitter API docs are very clear that you're getting retweets whether or not you wanted them, so you'd better include include_rts=1 so your code doesn't break at a future point when some hapless intern fixes the bug.
#!/usr/bin/python3

from twython import Twython
import json
import pprint
import pandas
from datetime import datetime, timedelta
from email.utils import parsedate_tz

with open( "secret.json", "r" ) as f:
    secret = json.load( f )

if "access" in secret:
    twitter = Twython( secret['key'], access_token=secret['access'] )
else:
    twitter = Twython( secret['key'], secret['secret'], oauth_version=2 )
    access_token = twitter.obtain_access_token()
    print( "access_token", access_token )

# Source: https://stackoverflow.com/questions/7703865/going-from-twitter-date-to-python-datetime-date
def timestamp_to_datetime( ts ):
    time_tuple = parsedate_tz( ts.strip() )
    dt = datetime( *time_tuple[:6] )
    return dt - timedelta( seconds=time_tuple[-1] )
    
tweets = {}
lastTime = datetime.now()
endTime = lastTime - timedelta( days = 365 )
lastId = None
screen_name = "NextRoguelike"
keys = [ 'id', 'created_at', 'text', 'retweet_count', 'favorite_count' ]

while endTime < lastTime:
    # API returns in reverse timeline order, starting with max_id,
    # so it will be duplicated.
    if lastId is None:
        timeline = twitter.get_user_timeline( screen_name=screen_name, count=100,
                                              include_rts=1 )
    else:
        timeline = twitter.get_user_timeline( screen_name=screen_name, count=100,
                                              include_rts=1, max_id = lastId )

    print( len( timeline ), "responses" )
    
    # FIXME: won't work for some account that only tweeted once :)
    if len( timeline ) <= 1:
        break

    for t in timeline:
        lastId = t['id']
        lastTime = timestamp_to_datetime( t['created_at'] )
        tweets[ lastId ] = [ t[k] for k in keys ]

df = pandas.DataFrame.from_dict( tweets, orient = 'index', columns = keys )
df.to_pickle( screen_name + "-tweets.pkl" )

Sort:  

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

Hi Mark. Interesting account you have here!

I have a question on this:

My application for a Twitter developer account was approved

Did you feel like they couldn't approve it for some reason? I thought to do it, never did, but if they go into some sort of screening it makes me anxious! Have I to justify why I want the developer account and what I'm going to do with it?

Yes, Twitter has an application form to fill out that asks for a description of what you plan to do with the API. It took about 20 days for them to review and approve it.