Journal‎ > ‎

major research progress updates

posted Dec 26, 2014, 11:40 AM by Tarek Hoteit   [ updated Dec 26, 2014, 11:50 AM ]
If you recall, the major goal for the research is to determine the relationship between sentiments on Twitter and corporate financial distress. The hypothesis questions are included in the dissertation proposal but all of them would require the data to be available, and that is what I have been able to do in the last month. Here is a brief summary:

0) At first, I setup my own Linux server, MySql Database, Django Web framework, GIT repsitory, iPython/Python, and every thing else needed for the server. The dedicated server is running on Linode servers at a Dallas location.

1) To determine the corporate firms that will be used in the analysis, I initially thought of the SP500 companies but I realized that such firms may be too big to be possibly under distress. Hence, I decided to target all publically held firms as are listed on SP500, Nasdaq, and NYSE. I first contacted SEC for the list of publically held firms but they have give me a huge list of individuals and corporates without confirming which firms are publically held or not. In the end, I found the list on NASDAQ website and I exported the list of firms in CSV format. I have approximately 6000 firms.

2) To determine the financials for each of the firms, I found out that Yahoo has a repository that supports JSON format and database-like querying, they call it YML, and can be used to retreived the financials of each firm (balance sheet, income statement, and cash flow) using their hosted apps that interface with SEC websites. I wrote a script using Python that pick each of the 6000 firms and extracts the financials for the 4th quarter 2014 and imports it into the database. I choice the financial variables according to Altman Z-Score for financial distress analysis but I can easily modify it to support other financial ratios. So now, I have 4th Quarter financials for all publically held firms that I will be using later

3) To extract sentiments from Twitter, I needed first to have a program that can interface with Twitter and extract the data that i need. At first, my thought was to extract all data since 2008 for each of the firms that reported bankruptcy as they are listed in the Bankruptcy Research Database. The problem, however, was that Twitter allows access to its archive for only the last week data. In addition, it enforces a limitation on how much one can extract unless you pay $$$$$$$$$$ (10s and 10s of thousand dollars) to their vendors that were contracted to sell and provide access to the Twitter firehose. Not wanting to do that, I decided to write a code that runs endlessly and monitors live Twitter streams looking for data about the corporate firms that I want. The algorithm work at first when I search for companies like Amazon or Google, but I could not get results when i search for companies like EDS. It turns out that the search capability over the Twitter API does not allow exhaustive search or partial-words search. Moreover, when you search for keywords, like Amazon, you would get sentences like "i bought this toy from Amazon" which has nothing to do with financial information. Luckily, I found out that I would get lots of financial related data on Twitter if i search by stock symbols, such as $VZ or $APPL . One problem though that Twitter does not allow a search on 6000 keywords, where each keyword would represent a stock symbol in my code. What I did is that I tuned my code to randomly pick 300 stock symbols from my database, listen to Twitter stream for 500 tweets that include one of the 300 stock symbols up to 500 messages and store the results in the database, and then pick another set of 300. The logic would work infintely. I had to keep monitoring the running algorithm all the time becasue sometimes certain stock symbols may take a day to get anything close to 500 messages. When the catch is low (few fish in the pond), I would force my code to pick another 300 stocks. So far, with the code running close to two weeks, I have managed to capture more than 56000 tweets that relates to any of the 6000 firms and the code will continue running until God knows when.

4) The step in 3 was only meant to collect the data from Twitter, but what we need is to be able to determine the sentiments in each of the tweets. This requires a machine learning algorithm because the whole goal is to automate the process and not just simply manually determing if the mass numbers of tweets are positive or negative sentiments. I researched several algorithms that can help me with this stack and I ended up with two . Python natural language toolkit (nltk) and Stanford Core NLP toolkit in Java. If I use nltk, I would need to basically write a whole code to parse each text and classify the tweets as positive or negative.  Luckily, Stanford Core NLP is open source and provides all such functionality so I decided to use Stanford Core NLP which I previously had in mind during my dissertation proposal. Stanford Core NLP provides sentiment analysis based on a supervised machine learning algorithm that requires pretraining of a set of data. The algorithm works but the sentiment classifications are based on movie reviews which means that financial-related workds like "buy" or "sell" or "loss" or "call" or "put" will not be properly classified as either positive or negative sentiments. Luckily, the algorithm allows retraining of the sentiment model which means I need to come up with a good set of sentences that I should manually classifiy as positive or negative first before I can let the model handle the rest of the text. Based on research, I calculated that if I need the machine learning algorithm to properly classify around 100,000 tweets (now and in the future) i need to manually classify around 10,000 tweets first and then feed the training to the algorithm. That is what I did for a whole week under (5) below

5) Manually classifiying tweets is a laborious and time consuming task. I thought using Amazon Manual Turk but there is a controversy on the quality of unknown people doing such work. So I decided to handle the training mayself with the help of family. I wanted to make the process very easy and the task accessible anywhere, so I created a mobile-supported website that picks a random tweet from my database and ask the user to classify the text as very negative, negative, neutral, positive, or very positive. It took one whole week to finish all 10,000. At one point when we had 3,000 classified, I found a major bug in the code that was labeling the sentiments to the wrong tweets, and that made me restart all over again. In the end, all 10,000 tweets were manually classified. I also updated the database to flag those tweets as training-related tweets so that I do not reuse them in the future

6) To retrain Stanford CoreNLP with new data, i need to convert the training data into Penn Treen Bank format, such as (1 (1 hello) (1 how) (1 are .....
I first wrote a Python script that extracts the training data from the database and also wrote a java code that used one of the CoreNlp classes to turn the data into PTB format.  After that, it was ready to retrain the machine learning algorithm

7) and here is where I am right now.... I fed the PTB data into the CoreNLP training code and the model is being trained right now...

Major progress updates will follow later on..

Comments