You need to have JavaScript enabled in order to access this site.

Course Syllabus

Syllabus
ISMT E-117 (#16099) Text Analytics and Natural Language Processing Instructor: Richard Joltes, ALM in IT, Harvard University Division of Continuing Education Meetings: Online via Zoom, starting Monday 8/31/2020 5:30 – 7:30PM Introduction The extraction of relevant information from a mass of raw, unstructured text can provide a cornucopia of useful insights, which can then be used to drive business decisions in a variety of contexts. If, for example, a manufacturer performs analytics on Voice Of Customer (VOC) texts extracted from vendor sites or their own customer service records, they may identify a product defect or service issue quickly enough to react before it becomes a newsworthy (or worse, litigation producing) event. From another perspective, if patient information systems aggregate records together and perform textual analysis on physician notes or diagnoses, an impending disease outbreak may be identified sufficiently in advance to help improve the medical community’s response. The opportunities for leveraging textual information are endless, yet many organizations are ill prepared or equipped to handle both the volume and the variety. Analysis of free-form text is messy and difficult since language is fluid and usages often vary from one region to another. However, many tools, both free and commercial, are available from a variety of sources. Some are complete, out-of-box solutions while others are toolkits requiring some level of programming experience to implement in a given context. This course introduces students to the tools, techniques, and opportunities for performing text analytics in a variety of contexts. We examine tools such as NLTK, spaCy, and the WordNet dictionary along with fully featured applications such as IBM’s Watson Explorer analytics platform. The new BERT model will be explored if time allows. Significant discussion will also be devoted to organizational aspects, such as governance, data integrity, and the process of identifying/processing a body of texts in order to address a business need. Course work involves using the selected tools to analyze groups of texts for insights such as: Sentiment – how a consumer or client feels about a product or experience Metadata – can we reliably identify phone numbers, credit card numbers, model numbers, or other specific elements? Named Entity Recognition – is it possible to reliably extract personal names, locations, and other specific entities from raw text? Parts of Speech (POS) – identifying the usage of words in context, in order to perform topic analysis or perform other Natural Language Processing (NLP) functions against a body of texts Word Senses – determining how a word is used in a specific context We will also spend significant time discussing basic linguistic concepts such as word senses, the various “-nym” forms (synonyms, homonyms, meronyms, etc.), lemmatization, stemming, and other areas relevant to search systems and text analysis. Note that this is an ISMT course, so while significant coding in Python is involved the intent of the course is the junction of technology and business requirements. Books/Readings Srinivasa-Desikan, Bhargav, Natural Language Processing and Computational Linguistics, Pakt 2018 Other online readings may become available during the term You can obtain the book via any online reseller, or via the Harvard Coop. Schedule Week 1: Introduction: what are we studying? Reading: NLP book chapter 1, skim chapter 2 Overview: “analytics” vs. text analytics Use cases, methods, and processes The problem of language analysis Disciplines: social sciences, AI, data mining, CS, library science, computational linguistics Course objectives and assignments The “V” elements: volume, variety, velocity…and veracity Quick intro to NLTK’s “book” methods DUE: Assignment 0 Note: NO CLASS MEETING on 9/7 (University holiday) Week 2: Basic Linguistics and Terms Reading: https://en.wikipedia.org/wiki/-onym -- drill down into at least the synonym, metonym, homonym, hypernym, and hyponym entries https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html https://en.wikipedia.org/wiki/Lemma_(morphology) https://en.wikipedia.org/wiki/Nomenclature https://en.wikipedia.org/wiki/Polysemy Parts of speech Noun vs. verb and other “senses” of a word and their effect on analytics ‘Nyms’ – homonyms, synonyms, troponyms, meronyms, and other variants Stopwords, dictionaries, and taxonomies Stemming, Lemmas, and other concepts Dealing with contractions and other specialized forms Week 3: A deeper dive into language analysis Reading: http://theconversation.com/teaching-machines-to-understand-and-summarize-text-78236 https://www.mitpressjournals.org/doi/10.1162/COLI_a_00239 http://www1.cs.columbia.edu/~sbenus/Teaching/APTD/McKee_Ch1.pdf (at least skim this) Extracting meaning from text How do humans parse language? Understanding complexity Preparing to analyze texts The process of text analysis Week 4: Asking questions of your data What do we mean by “question?” How to define desired outcomes “Accuracy” and statistical significance DUE: Assignment 1 Week 5: Step 1: tokenization Reading: NLP book pp. 43-51, chapter 5 https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization Space-based vs. NLP style tokenization Various tokenizer models Note: NO CLASS MEETING on 10/12 (University Holiday) Week 6: Step 2: Part-of-Speech (POS) tagging and Named Entity Recognition (NER) Reading: http://www.d.umn.edu/~tpederse/Pubs/cicling2002-b.pdf http://www.nltk.org/howto/wsd.html https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da What is POS tagging? Stop words and dictionaries What is NER and why do we care about it? Lesk Algorithm and Word Sense Disambiguation (WSD) Week 7: NLP and Project Management (note: changed topic) Reading: NLP book, chapter 6 NLP projects: characteristics, planning, and other factors Management buy-in, factoring for DUE: Assignment 2 Mmidterm examination 10/31 Week 8: NLP in a nutshell Reading: NLP book, chapter 7 Sentence parsing and interpretation What NLP can and can’t do! N-grams: individual words vs. multi-word phrases and context Week 9: Expanding Opportunities with Linguistic forms Query Expansion using synonyms, metonyms, hypernyms Narrow vs. broad queries and the use of hyponyms or hypernyms Week 10: Sentiment Analysis What is “sentiment” How is it used in text analytics? In business? Tools and techniques Limitations Week 11: Topic Modeling Extracting topics from a document corpus Strategies, methods, and outcomes Due: Assignment 3 Week 12: Data Governance and Ethics Reading: https://blog.datasalon.com/2013/02/28/a-beginners-guide-to-data-governance/ https://www.talend.com/blog/2019/05/02/the-fundamentals-of-data-governance-part-1/ https://a.sfdcstatic.com/content/dam/www/ocms/assets/pdf/misc/data_Governance_Stewardship_ebook.pdf Basic governance concepts Data Ethics Why governance matters to text analytics and NLP Week 13: Ontologies, Taxonomies, and Dictionaries (pre-recorded, accessible via Zoom) Taxonomy basics Applications to machine learning Ontologies Dictionaries Applications to text analysis Final projects due We will spend the last class period reviewing the results of the teams’ projects, assessing outcomes, and critiquing methods. Each team will spend 10-15 minutes presenting their results to the rest of the class. All students are required to participate. Assignments Assignment 0 (introduction to course environment) Assignment 1 Assignment 2 Assignment 3 Final project (note: teams will be identified during the term) – a 10-12-page paper describing the results of the initial analysis of a small to medium sized data set, with recommendations for further work (details to be provided later) Note: ALL assignments are to be submitted via Canvas. Do not email completed assignments to the instructor as they will not be counted or returned. Prerequisites Students enrolling in this course must have: A background in programming, preferably in Python or a closely related object-oriented language An understanding of how to use a text editor or an IDE such as Eclipse, as opposed to a word processor like MS Office or OpenOffice Important! Prior experience writing code in a Linux environment via command line tools is required if you do not have a working Python installation of your own Some background in linguistics or statistics may be helpful, but are not required This course teaches the following areas: Linguistic concepts that are directly applicable to text analytics Practical approaches and concepts regarding text analytics using NLTK (Natural Language Tool Kit) and sPacy, plus other related tool sets “Business” uses and cases for the use of text analytics and NLP This course does NOT teach: How to use a text editor The Python language File system concepts Operating systems (e.g. Linux, Windows) or how to install/configure any of the tools used in the course Attendance: Regular attendance of live lecture sessions is strongly encouraged so that questions can be asked and discussions participated in. Recordings will be made available weekly, and students are also expected to interact regularly (asking questions, discussing breakthroughs, and/or bringing up topics for which they need additional detail, for example) in Canvas. Course staff will monitor discussions and assist as needed. You are also encouraged to post links to interesting new tech related to the course, case studies, or other material others will find relevant. Paricipate! Have fun! All work is to be performed using the systems provided by the Extension School; students may opt to install Python, NLTK, and other prerequisites on their own systems but will not receive any assistance from course staff and must submit assignments using Canvas. Note that while you can use Jupyter notebooks when creating code, you cannot submit them since the Canvas grading system is incapable of parsing that file format. Important note: plagiarism of any kind will be dealt with in the strongest possible manner. Please see the student guide to academic honesty for details. https://handbook.fas.harvard.edu/book/academic-integrity You are responsible for understanding Harvard Extension School policies on academic integrity (https://www.extension.harvard.edu/resources-policies/student-conduct/academic-integrity) and how to use sources responsibly. Not knowing the rules, misunderstanding the rules, running out of time, submitting the wrong draft, or being overwhelmed with multiple demands are not acceptable excuses. There are no excuses for failure to uphold academic integrity. To support your learning about academic citation rules, please visit the Harvard Extension School Tips to Avoid Plagiarism (https://www.extension.harvard.edu/resources-policies/resources/tips-avoid-plagiarism), where you'll find links to the Harvard Guide to Using Sources and two free online 15-minute tutorials to test your knowledge of academic citation policy. The tutorials are anonymous open-learning tools. Accessibility: The Extension School is committed to providing an accessible academic community. The Accessibility Office offers a variety of accommodations and services to students with documented disabilities. Please visit https://www.extension.harvard.edu/resources-policies/resources/disability-services-accessibility for more information.
Course Information

Student Support Tips Instructor Support Tips

Course Summary:

Course Summary
Date	Details	Due

December 2025

Calendar
Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
30 November 2025 Previous month Next month Today Click to view event details	1 December 2025 Previous month Next month Today Click to view event details	2 December 2025 Previous month Next month Today Click to view event details	3 December 2025 Previous month Next month Today Click to view event details	4 December 2025 Previous month Next month Today Click to view event details	5 December 2025 Previous month Next month Today Click to view event details	6 December 2025 Previous month Next month Today Click to view event details
7 December 2025 Previous month Next month Today Click to view event details	8 December 2025 Previous month Next month Today Click to view event details	9 December 2025 Previous month Next month Today Click to view event details	10 December 2025 Previous month Next month Today Click to view event details	11 December 2025 Previous month Next month Today Click to view event details	12 December 2025 Previous month Next month Today Click to view event details	13 December 2025 Previous month Next month Today Click to view event details
14 December 2025 Previous month Next month Today Click to view event details	15 December 2025 Previous month Next month Today Click to view event details	16 December 2025 Previous month Next month Today Click to view event details	17 December 2025 Previous month Next month Today Click to view event details	18 December 2025 Previous month Next month Today Click to view event details	19 December 2025 Previous month Next month Today Click to view event details	20 December 2025 Previous month Next month Today Click to view event details
21 December 2025 Previous month Next month Today Click to view event details	22 December 2025 Previous month Next month Today Click to view event details	23 December 2025 Previous month Next month Today Click to view event details	24 December 2025 Previous month Next month Today Click to view event details	25 December 2025 Previous month Next month Today Click to view event details	26 December 2025 Previous month Next month Today Click to view event details	27 December 2025 Previous month Next month Today Click to view event details
28 December 2025 Previous month Next month Today Click to view event details	29 December 2025 Previous month Next month Today Click to view event details	30 December 2025 Previous month Next month Today Click to view event details	31 December 2025 Previous month Next month Today Click to view event details	1 January 2026 Previous month Next month Today Click to view event details	2 January 2026 Previous month Next month Today Click to view event details	3 January 2026 Previous month Next month Today Click to view event details
4 January 2026 Previous month Next month Today Click to view event details	5 January 2026 Previous month Next month Today Click to view event details	6 January 2026 Previous month Next month Today Click to view event details	7 January 2026 Previous month Next month Today Click to view event details	8 January 2026 Previous month Next month Today Click to view event details	9 January 2026 Previous month Next month Today Click to view event details	10 January 2026 Previous month Next month Today Click to view event details

Assignments are weighted by group:

Group	Weight
Assignments	0%
Assignment 0	5%
Regular Assignments	30%
Midterm	25%
Final Project	40%
Total	100%