Twitter Trading – Downloading Tweets Using Python (Part 2 of 2)

Previously in Part 1 code was produced to use tweetstream and download tweets in real time, this part of the series will look at how to store these tweets for future analysis.

The tweets will be saved into SqliteDB ( since this is compatible with python and R. Also its much simpler than some versions of SQL (no daemons / services to install). Good Sqlite information for R (Sqlite with R) and Python (Sqlite Python Docs)

The database will contain a “source” table. This is used to track what information source the tweets are downloaded from. For now it’ll just contain one entry, and that is twitter. Just collecting real time sweets means we’ll have to wait a long time to get a significant database to analyse. In a future article tweets will be downloaded from historical archives (hence are a different information source).

Setup the db:

-bash-3.2$ sqlite3 twitter.db
SQLite version 3.5.9
Enter ".help" for instructions
sqlite> CREATE TABLE sources(sourceid INTEGER,sourcedesc,PRIMARY KEY(sourceid));
sqlite> INSERT INTO sources(sourcedesc) VALUES ("twitter");
sqlite> SELECT * FROM sources;
sqlite> CREATE TABLE tweets(id INTEGER, sourceid REAL,username TEXT,tweet TEXT,timestamp TIMESTAMP, PRIMARY KEY(id));
sqlite> .exit

Now onto the code, you may want to alter the try catchs depending on how you want the script to respond to errors. On my server there is a cron job that launches the script every 5 min, so if it dies it’ll get restarted.

?View Code PYTHON
import sqlite3 as lite
import tweetstream
import csv
import sys
from datetime import date
import time
twitterUsername = "USERNAME"
twitterPassword = "PASSWORD"
twitterWordFilter = [] #Defined the list
wordListCsv = csv.reader(open('wordstofilter.csv', 'rb'))
for row in wordListCsv:
    #Add the 0th column of the current row to the list
print "Filtering the following words: ",', '.join(twitterWordFilter)
    #Load the data base file (or make it if not found)
    #If dont set isolation level then we need to call
    #Db commit after every execute to save the transaction
    con = lite.connect('twitter.db',isolation_level=None)
    cur = con.cursor() #Use cursor for executing queries
    #Get the sourceid (will be useful when we use multiple data sources)
    cur.execute("SELECT sourceid FROM sources where sourcedesc='twitter'")
    sourceid = cur.fetchone() #Get the source id
    sourceid = sourceid[0]
    with tweetstream.FilterStream(twitterUsername, twitterPassword,track=twitterWordFilter) as stream:
        for tweet in stream:
            tweettimestamp =  time.mktime(time.strptime(tweet['created_at'],"%a %b %d %H:%M:%S +0000 %Y")) - time.timezone
            print stream.count,"(",stream.rate,"tweets/sec). ",tweet['user']['screen_name'],':', tweet['text'].encode('ascii','ignore')
            #print tweet #Use for raw output            
                cur.execute("INSERT INTO tweets(sourceid,username,tweet,timestamp) VALUES(?,?,?,?)",[sourceid,tweet['user']['screen_name'],tweet['text'].encode('ascii','ignore'),tweettimestamp])
                print "SQL Insert Error: Probably some encoding issue due to foreign languages"
except tweetstream.ConnectionError, e:
    print "Disconnected from twitter. Reason:", e.reason
except lite.Error, e:
    print "SQLite Error:",e
    print "ERROR:",sys.exc_info()    
    if con:

Don’t forget to update the wordsfilterlist.csv to track the stocks that you’re interested in. It looks like american stocks respond well to $TICKER, this is because the stocktwits has forced a nice format. From a quick play around it looks like using tickers without symbols in front just matches random foreign tweets.

Attached is the complete project so far including a db (TwitterDownload.Zip).

Leave a Reply

Your email address will not be published. Required fields are marked *