Previously in Part 1 code was produced to use tweetstream and download tweets in real time, this part of the series will look at how to store these tweets for future analysis.
The tweets will be saved into SqliteDB (http://www.sqlite.org/download.html) since this is compatible with python and R. Also its much simpler than some versions of SQL (no daemons / services to install). Good Sqlite information for R (Sqlite with R) and Python (Sqlite Python Docs)
The database will contain a “source” table. This is used to track what information source the tweets are downloaded from. For now it’ll just contain one entry, and that is twitter. Just collecting real time sweets means we’ll have to wait a long time to get a significant database to analyse. In a future article tweets will be downloaded from historical archives (hence are a different information source).
Setup the db:
-bash-3.2$ sqlite3 twitter.db SQLite version 3.5.9 Enter ".help" for instructions sqlite> CREATE TABLE sources(sourceid INTEGER,sourcedesc,PRIMARY KEY(sourceid)); sqlite> INSERT INTO sources(sourcedesc) VALUES ("twitter"); sqlite> SELECT * FROM sources; 1|twitter sqlite> CREATE TABLE tweets(id INTEGER, sourceid REAL,username TEXT,tweet TEXT,timestamp TIMESTAMP, PRIMARY KEY(id)); sqlite> .exit -bash-3.2$
Now onto the code, you may want to alter the try catchs depending on how you want the script to respond to errors. On my server there is a cron job that launches the script every 5 min, so if it dies it’ll get restarted.
import sqlite3 as lite import tweetstream import csv import sys from datetime import date import time twitterUsername = "USERNAME" twitterPassword = "PASSWORD" twitterWordFilter =  #Defined the list wordListCsv = csv.reader(open('wordstofilter.csv', 'rb')) for row in wordListCsv: #Add the 0th column of the current row to the list twitterWordFilter.append(row) print "Filtering the following words: ",', '.join(twitterWordFilter) try: #Load the data base file (or make it if not found) #If dont set isolation level then we need to call #Db commit after every execute to save the transaction con = lite.connect('twitter.db',isolation_level=None) cur = con.cursor() #Use cursor for executing queries #Get the sourceid (will be useful when we use multiple data sources) cur.execute("SELECT sourceid FROM sources where sourcedesc='twitter'") sourceid = cur.fetchone() #Get the source id sourceid = sourceid with tweetstream.FilterStream(twitterUsername, twitterPassword,track=twitterWordFilter) as stream: for tweet in stream: tweettimestamp = time.mktime(time.strptime(tweet['created_at'],"%a %b %d %H:%M:%S +0000 %Y")) - time.timezone print stream.count,"(",stream.rate,"tweets/sec). ",tweet['user']['screen_name'],':', tweet['text'].encode('ascii','ignore') #print tweet #Use for raw output try: cur.execute("INSERT INTO tweets(sourceid,username,tweet,timestamp) VALUES(?,?,?,?)",[sourceid,tweet['user']['screen_name'],tweet['text'].encode('ascii','ignore'),tweettimestamp]) except: print "SQL Insert Error: Probably some encoding issue due to foreign languages" except tweetstream.ConnectionError, e: print "Disconnected from twitter. Reason:", e.reason except lite.Error, e: print "SQLite Error:",e except: print "ERROR:",sys.exc_info() finally: if con: con.close()
Don’t forget to update the wordsfilterlist.csv to track the stocks that you’re interested in. It looks like american stocks respond well to $TICKER, this is because the stocktwits has forced a nice format. From a quick play around it looks like using tickers without symbols in front just matches random foreign tweets.
Attached is the complete project so far including a db (TwitterDownload.Zip).