Complete Dissertation

posted Dec 3, 2016, 4:10 AM by Tarek Hoteit   [ updated Dec 3, 2016, 4:16 AM ]

uploaded the final and approved dissertation. It is also available on

GRADUATED!!! Dissertation Published. Thanks, Dr Tarek Hoteit

posted Sep 8, 2015, 9:23 AM by Tarek Hoteit

major research progress updates

posted Dec 26, 2014, 11:40 AM by Tarek Hoteit   [ updated Dec 26, 2014, 11:50 AM ]

If you recall, the major goal for the research is to determine the relationship between sentiments on Twitter and corporate financial distress. The hypothesis questions are included in the dissertation proposal but all of them would require the data to be available, and that is what I have been able to do in the last month. Here is a brief summary:

0) At first, I setup my own Linux server, MySql Database, Django Web framework, GIT repsitory, iPython/Python, and every thing else needed for the server. The dedicated server is running on Linode servers at a Dallas location.

1) To determine the corporate firms that will be used in the analysis, I initially thought of the SP500 companies but I realized that such firms may be too big to be possibly under distress. Hence, I decided to target all publically held firms as are listed on SP500, Nasdaq, and NYSE. I first contacted SEC for the list of publically held firms but they have give me a huge list of individuals and corporates without confirming which firms are publically held or not. In the end, I found the list on NASDAQ website and I exported the list of firms in CSV format. I have approximately 6000 firms.

2) To determine the financials for each of the firms, I found out that Yahoo has a repository that supports JSON format and database-like querying, they call it YML, and can be used to retreived the financials of each firm (balance sheet, income statement, and cash flow) using their hosted apps that interface with SEC websites. I wrote a script using Python that pick each of the 6000 firms and extracts the financials for the 4th quarter 2014 and imports it into the database. I choice the financial variables according to Altman Z-Score for financial distress analysis but I can easily modify it to support other financial ratios. So now, I have 4th Quarter financials for all publically held firms that I will be using later

3) To extract sentiments from Twitter, I needed first to have a program that can interface with Twitter and extract the data that i need. At first, my thought was to extract all data since 2008 for each of the firms that reported bankruptcy as they are listed in the Bankruptcy Research Database. The problem, however, was that Twitter allows access to its archive for only the last week data. In addition, it enforces a limitation on how much one can extract unless you pay $$$$$$$$$$ (10s and 10s of thousand dollars) to their vendors that were contracted to sell and provide access to the Twitter firehose. Not wanting to do that, I decided to write a code that runs endlessly and monitors live Twitter streams looking for data about the corporate firms that I want. The algorithm work at first when I search for companies like Amazon or Google, but I could not get results when i search for companies like EDS. It turns out that the search capability over the Twitter API does not allow exhaustive search or partial-words search. Moreover, when you search for keywords, like Amazon, you would get sentences like "i bought this toy from Amazon" which has nothing to do with financial information. Luckily, I found out that I would get lots of financial related data on Twitter if i search by stock symbols, such as $VZ or $APPL . One problem though that Twitter does not allow a search on 6000 keywords, where each keyword would represent a stock symbol in my code. What I did is that I tuned my code to randomly pick 300 stock symbols from my database, listen to Twitter stream for 500 tweets that include one of the 300 stock symbols up to 500 messages and store the results in the database, and then pick another set of 300. The logic would work infintely. I had to keep monitoring the running algorithm all the time becasue sometimes certain stock symbols may take a day to get anything close to 500 messages. When the catch is low (few fish in the pond), I would force my code to pick another 300 stocks. So far, with the code running close to two weeks, I have managed to capture more than 56000 tweets that relates to any of the 6000 firms and the code will continue running until God knows when.

4) The step in 3 was only meant to collect the data from Twitter, but what we need is to be able to determine the sentiments in each of the tweets. This requires a machine learning algorithm because the whole goal is to automate the process and not just simply manually determing if the mass numbers of tweets are positive or negative sentiments. I researched several algorithms that can help me with this stack and I ended up with two . Python natural language toolkit (nltk) and Stanford Core NLP toolkit in Java. If I use nltk, I would need to basically write a whole code to parse each text and classify the tweets as positive or negative.  Luckily, Stanford Core NLP is open source and provides all such functionality so I decided to use Stanford Core NLP which I previously had in mind during my dissertation proposal. Stanford Core NLP provides sentiment analysis based on a supervised machine learning algorithm that requires pretraining of a set of data. The algorithm works but the sentiment classifications are based on movie reviews which means that financial-related workds like "buy" or "sell" or "loss" or "call" or "put" will not be properly classified as either positive or negative sentiments. Luckily, the algorithm allows retraining of the sentiment model which means I need to come up with a good set of sentences that I should manually classifiy as positive or negative first before I can let the model handle the rest of the text. Based on research, I calculated that if I need the machine learning algorithm to properly classify around 100,000 tweets (now and in the future) i need to manually classify around 10,000 tweets first and then feed the training to the algorithm. That is what I did for a whole week under (5) below

5) Manually classifiying tweets is a laborious and time consuming task. I thought using Amazon Manual Turk but there is a controversy on the quality of unknown people doing such work. So I decided to handle the training mayself with the help of family. I wanted to make the process very easy and the task accessible anywhere, so I created a mobile-supported website that picks a random tweet from my database and ask the user to classify the text as very negative, negative, neutral, positive, or very positive. It took one whole week to finish all 10,000. At one point when we had 3,000 classified, I found a major bug in the code that was labeling the sentiments to the wrong tweets, and that made me restart all over again. In the end, all 10,000 tweets were manually classified. I also updated the database to flag those tweets as training-related tweets so that I do not reuse them in the future

6) To retrain Stanford CoreNLP with new data, i need to convert the training data into Penn Treen Bank format, such as (1 (1 hello) (1 how) (1 are .....
I first wrote a Python script that extracts the training data from the database and also wrote a java code that used one of the CoreNlp classes to turn the data into PTB format.  After that, it was ready to retrain the machine learning algorithm

7) and here is where I am right now.... I fed the PTB data into the CoreNLP training code and the model is being trained right now...

Major progress updates will follow later on..

code part A completed

posted Dec 6, 2014, 4:47 AM by Tarek Hoteit

I finally got the first module completed. The Python code is running infinitely, searching for fortune 500 companies on Twitter and then storing the tweets into the database.

database need to be updated to support tweets with emogi

posted Nov 28, 2014, 3:28 PM by Tarek Hoteit

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
ALTER DATABASE finsentimentdb CHARACTER SET = utf8mb4 COLLATE utf8mb4_unicode_ci;

ALTER TABLE finsentimentdb.twitterSentiment CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

ALTER TABLE twitterSentiment CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE twitterSentiment_twittertext CHANGE twitter_text twitter_text varchar(1024)  CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL;

    ’default’: {
        ’ENGINE’: ’django.db.backends.mysql’,
        ’NAME’: ’example’,
        ’USER’: ’example’,
        ’PASSWORD’: ’example’,
        ’HOST’: ’’,
        ’PORT’: ’’,
        ’OPTIONS’: {’charset’: ’utf8mb4’},


finSentiment code plan - initial progress

posted Nov 10, 2014, 4:48 AM by Tarek Hoteit

Here is the plan:
0) Create code repository. Done. I am using my personal hosted gitlab on (use GitLab/PyCharm)
1) setup the basic database and framework. Done. I picked sqlite and django. Note that in the proposal I mentioned using mysql. We might change either the code or the draft for this after we determine the performance of either db (use Django shell script)
2) create the db tables. Done. Basic tables are created. Of course more to be added (run Django script)  
3) Create django views. In progress. Do the Admin views first because it is easier to first implement (run Django script)
4) Add the list of firms into the database (create Python script) using the Excel sheet from the Bankruptcy Research Database (BRD)

After completing 4, we should have an initial backbone for the system

5) develop Twitter code to extract textual data from the site. (create Python script) 
6) Execute the code and import the results into the database (create Python script)

After completing 6, we should have a loaded database with textual data. Now, the sentiment classifications step

7) setup a Java environment for Stanford Sentiment Classification algorithm (Linux admin)
8) execute the analyzer for each of the Twitter texts in our database (write a bash or Perl shell script to do that)
9) import the results into our database (bash or Perl script or sqlite/mysql script)

After completing 9, we should have our database loaded with the sentiment classifications for each Twitter text about the firms in the study. Next is complete the statistical data analysis

10) run a logistic regression algorithm to determine if the sentiment is statistically significant , etc... (more to come on that later)

Resuming Dissertation

posted Nov 10, 2014, 3:49 AM by Tarek Hoteit   [ updated Nov 10, 2014, 4:49 AM ]

I am back from leave of absence. The committee has been updated since Dr Hamzaee retired. Dr Bouvin is my dissertation chair and Dr Gould is my committee member.
Last month Dr Gould provided new feedback to my proposal and proposed that I update my proposal using the new dissertation template. Since then, I made the updates but I did not send an updated draft to Dr Gould yet.

[[[[list the updates here]]]]]

updating the study to quant

posted Jun 22, 2014, 11:59 AM by Tarek Hoteit   [ updated Jun 24, 2014, 5:53 AM ]

still work in progress. I am focusing on pages 74-78..

quantitative checklist and proposal draft modifications

posted Jun 19, 2014, 7:52 AM by Tarek Hoteit

I am modifying the study to make it quantitative instead of mixed-method, and I am also updating the check list. From the check list, I am now in Chapter 1: Definitions.
I did changes to the draft but I still need to correct the issue of trustworthiness and the role of the researcher

1-10 of 20