Need the .py file(s) and a doc (Word, readme, etc.) to summarize what you did.
Imagine that you have just been hired as an intern by an NLP startup. They have a sentiment analysis project which they started but didn’t finish because it was not getting good results, and it was slow. They hand you the project and ask you to try to get better results as the first priority and a faster execution time as a secondary priority.For reasons they do not explain to you, they are emphatic that this project must not use machine learning, but only standard NLP techniques.
Here is what they tell you about the project:
- eventually they want to run it on larger amounts of data but for now they have been working on a prototype with a small set of data
- the data consists of two folders, pos and neg, which each contain 100 movie reviews
- for each folder (pos and neg):
- for each file:
- read the text from the file
- remove punctuation, digits, newline, and stopwords
- send the tokens to the assess_polarity1() function
- keep a running total of the number of files classified correctly
- print the overall accuracy for that folder (pos or neg)
- for each file:
- the assess_polarity1() function:
- for each token:
- find the senti synset for that word
- assume that the first synset is the one we want
- add the pos and neg scores to running totals
- divide the running totals by the number of tokens to get a negative score and a positive score
Here is what they have done:
Your task is to make this program better, primarily in terms of accuracy, but also in terms of run time, but give the priority to accuracy.
There are a couple of things you should not change in the program:
- The use of pathlib. You should create a new project, and copy in the movie reviews folder and driver.py file.This code was tested on a MacBook and a Windows 10 computer and was able to read the files on either system by using pathlib. Using pathlib will make grading easier for the TA.
- The use of the timer. There are other ways to time functions, as in timeit(), but everyone should use the same time function as shown in the code. You should first run the program as is on your system and record how long it takes on your computer. Also, record the accuracy for the pos and the neg folders.
Other than that, you are free to experiment with making a better sentiment analysis system.
- Word, readme.txt, etc., anything that is cross-platform
- First list the baseline metrics, and some analysis,for your system on the code as downloaded:
- accuracy for the negative folder
- accuracy for the positive folder
- run time on your system
- analyze the weaknesses of this approach
- Then, for every new approach you try **even if it didn’t work well**, document what you did and the new accuracies and run times
- For each approach, provide some analysis on why it worked better or worse.