A Naive Bayes classifier for Shakespeare's second-person pronoun

K. Mahowald
2011 Literary and Linguistic Computing  
In order to investigate in explicit detail the way that y-and th-pronouns alternate in the Shakespearean corpus, I have undertaken a collocational analysis of the full corpus of Shakespeare's 37 plays and found that (1) second-person pronouns can be disambiguated based on context alone, (2) y-pronouns seem to be used in more formal situations or when an inferior is addressing a social better, and (3) the th-pronoun is reserved for addressing peers, servants, or other familiar personages.
more » ... the Python Natural Language Toolkit (Bird et al., 2009, Natural Language Processing with Python. Sebastopol, CA: O'Reilly Media), I implemented a Naïve Bayes classifier that in effect treats each occurrence of a second-person pronoun as a black box that must be resolved into either a ypronoun or a th-pronoun based only on the surrounding words. Using tenfold cross-validation, the classifier achieves an accuracy of 78.3% when fellow th-and y-pronouns are excluded from the context and 88.0% when we allow fellow thand y-pronouns to assist in classification. Most interesting, however, are the context words that prove most informative in categorizing the pronouns. Significantly, the words most useful in classifying a pronoun as a y-pronoun include high-register words such as lordship, madam, lords, and sir. After a group of conjugated second-person verbs like art and wert, the words most associated with th-pronouns are words such as torment, nuncle, lesser, and villain. The ability to discriminate between forms based only on context confirms the hypothesis that the two classes of second-person pronoun are indeed used distinctly in the Shakespearean corpus. The list of words most helpful in making that distinction strongly suggests a difference in formality. We can also gain additional insight into the plays by examining some of the unexpected words that collocate with either one form or the other.
doi:10.1093/llc/fqr045 fatcat:5zzpcha6tjbpzinmel3myfrhkq