View the Overview video for more information on this learning sequence, including a short lecture on Natural Language Processing. (Check the Resources section for links to more information on advanced concepts such as Part Of Speech Tagging and Lexicon Normalisation.)
View the video Setup. This will introduce the repl.it coding environment, and the process for obtaining the text for Alice in Wonderland from the online repository Project Gutenberg.
View the video Removing punctuation. In this part, we write and test a function to remove all punctuation from the text, so that our text analysis can focus on words only.
(Completed code up to this point.)
Functions will be used throughout this learning sequence, including functions with parameters and return variables.
View the video Intro to Functions in Python for a brief introduction to writing and using functions.
View the video Tokenisation 1, noting the minor changes made manually to the text of Alice in Wonderland before the coding begins:
In the part, we write new functions to break up the book text into a list of words, or into a list of sentences.
(Completed code up to this point.)
Now view the video Tokenisation 2. In this video, we write two more functions to break up the book text into a list of paragraphs, or into a list of chapters. This completes our library of functions for tokenising the book's text.
(Completed code up to this point.)
Working with large bodies of text usually requires a bit of manual editing.
View the video Text File Preparation for more details on obtaining and editing the most suitable book text of Alice in Wonderland.
View the video Modular programming. In this part, the functions we've created are separated into a different file.
(Completed code up to this point.)
View the video Lecture – Sentiment Analysis to:
Lectures are primarily intended for a teacher audience, but you may choose to re-view the video with students.
View the video Testing Sentiment Analysis. In this part, the TextBlob module is used to attempt to rate the polarity and subjectivity of sentences, paragraphs and chapters. (For an explanation of these two concepts, be sure to view the lecture in the previous section.)
(Completed code up to this point.)
View the video Frequency of Words 1. In this part, we write and test a function to store each unique word in the book alongside the number of times that word appears in the book. In PART 7, we’ll rank these to find out the most frequent words in the book.
(Completed code up to this point.)
A dictionary data structure is used to hold the data for the most frequent words. This data structure is sometimes called an associative array or a map in other programming languages.
View the video Intro to Dictionaries for a brief introduction to the dictionary data structure.
These short exercises will help you practice using Python Dictionaries:
Now view the video Frequency of Words 2. In this video, we make some improvements to our function. Now, the dictionary will be constructed ignoring the case of the words from the book. So, “Rabbit” and “rabbit” will now be considered one unique word, and the frequency value will reflect all instances of both words.
But, if the function’s new second argument cap is set to True, something quite different will happen. The dictionary will be constructed with only the Title Case words from the book, such as proper nouns. So, “rabbit” will not be included at all, but words like “Rabbit” and “Alice” will be included.
During this video, the presenter makes use of a Python shortcut called a List Comprehension, which allows a list to be quickly made from another list without many lines of code. This is not an essential skill and is merely done for convenience. See this external tutorial for more information on List Comprehensions.
View the video Ranking Words by Frequency. In this short part, we make use of a dictionary created with the function we wrote in part 6. The dictionary contains unique words from the book alongside how often they appear in the book. Now we will sort those entries. The result is a simple list of the words ordered by frequency.
(Completed code up to this point.)
View the video Removing Stop Words. In this part, we write a function to filter out very common English words, and another function to filter out stop words like “the”, “I”, “and”. By first removing all these from the list of words in the book, our ranked list of frequent words will be more useful.
(Completed code up to this point.)
View the video Heroes and Villains 1. By now, we have identified the likely main characters in the book. In this part, we use sentiment analysis to guess whether each character is a hero or villain.
(Completed code up to this point.)
Now view the final video Heroes and Villains 2. In this video, we try a couple of other books, and we tweak the behaviour of the function for judging the main characters.
Initially, students may wish to try reusing the program already written in this learning sequence:
Sometimes it can be hard to tell if articles from newspapers and other online sources are reporting pieces or opinion pieces.
Write a fresh program that uses sentiment analysis to determine if an article is a reporting piece or an opinion piece, based on the subjectivity of its language. We might expect opinion pieces to have higher subjectivity.
You will need to:
Note, you may wish to reuse modules or functions written in this learning sequence, but your main program must be freshly written, with appropriate comments.
Design and implement a research project test a limit of the sentiment analysis approach used in this learning sequence. eg. How well does it respond to different conversational styles?
Write a new tool for analysing dialogue in a movie or play script.
The tool will be able to separate each paragraph, ignore stage/screen directions and notes, and connect each bit of dialogue with a character from a limited cast.
From there, characters can be compared based on volume of dialogue and sentiment analysis of dialogue.