Course Syllabus

 Syllabus

ISMT E-117 (#16099)
Text Analytics and Natural Language Processing

Instructor: Richard Joltes, ALM in IT, Harvard University Division of Continuing Education

Meetings: Online via Zoom, starting Monday 8/31/2020 5:30 – 7:30PM

Introduction

The extraction of relevant information from a mass of raw, unstructured text can provide a cornucopia of useful insights, which can then be used to drive business decisions in a variety of contexts. If, for example, a manufacturer performs analytics on Voice Of Customer (VOC) texts extracted from vendor sites or their own customer service records, they may identify a product defect or service issue quickly enough to react before it becomes a newsworthy (or worse, litigation producing) event.

From another perspective, if patient information systems aggregate records together and perform textual analysis on physician notes or diagnoses, an impending disease outbreak may be identified sufficiently in advance to help improve the medical community’s response. The opportunities for leveraging textual information are endless, yet many organizations are ill prepared or equipped to handle both the volume and the variety.

Analysis of free-form text is messy and difficult since language is fluid and usages often vary from one region to another. However, many tools, both free and commercial, are available from a variety of sources. Some are complete, out-of-box solutions while others are toolkits requiring some level of programming experience to implement in a given context.

This course introduces students to the tools, techniques, and opportunities for performing text analytics in a variety of contexts. We examine tools such as NLTK, spaCy, and the WordNet dictionary along with fully featured applications such as IBM’s Watson Explorer analytics platform. The new BERT model will be explored if time allows. Significant discussion will also be devoted to organizational aspects, such as governance, data integrity, and the process of identifying/processing a body of texts in order to address a business need.

Course work involves using the selected tools to analyze groups of texts for insights such as:

  • Sentiment – how a consumer or client feels about a product or experience
  • Metadata – can we reliably identify phone numbers, credit card numbers, model numbers, or other specific elements?
  • Named Entity Recognition – is it possible to reliably extract personal names, locations, and other specific entities from raw text?
  • Parts of Speech (POS) – identifying the usage of words in context, in order to perform topic analysis or perform other Natural Language Processing (NLP) functions against a body of texts
  • Word Senses – determining how a word is used in a specific context

We will also spend significant time discussing basic linguistic concepts such as word senses, the various “-nym” forms (synonyms, homonyms, meronyms, etc.), lemmatization, stemming, and other areas relevant to search systems and text analysis.  Note that this is an ISMT course, so while significant coding in Python is involved the intent of the course is the junction of technology and business requirements.

Books/Readings

Srinivasa-Desikan, Bhargav, Natural Language Processing and Computational Linguistics, Pakt 2018
Other online readings may become available during the term

You can obtain the book via any online reseller, or via the Harvard Coop.

Schedule

Week 1: Introduction: what are we studying?
Reading: NLP book chapter 1, skim chapter 2

  • Overview: “analytics” vs. text analytics
  • Use cases, methods, and processes
  • The problem of language analysis
  • Disciplines: social sciences, AI, data mining, CS, library science, computational linguistics
  • Course objectives and assignments
  • The “V” elements: volume, variety, velocity…and veracity
  • Quick intro to NLTK’s “book” methods
  • DUE: Assignment 0

Note: NO CLASS MEETING on 9/7 (University holiday)
Week 2: Basic Linguistics and Terms
Reading: https://en.wikipedia.org/wiki/-onym  -- drill down into at least the synonym, metonym, homonym, hypernym, and hyponym entries
https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
https://en.wikipedia.org/wiki/Lemma_(morphology)
https://en.wikipedia.org/wiki/Nomenclature
https://en.wikipedia.org/wiki/Polysemy

  • Parts of speech
  • Noun vs. verb and other “senses” of a word and their effect on analytics
  • ‘Nyms’ – homonyms, synonyms, troponyms, meronyms, and other variants
  • Stopwords, dictionaries, and taxonomies
  • Stemming, Lemmas, and other concepts
  • Dealing with contractions and other specialized forms

Week 3: A deeper dive into language analysis

Reading: http://theconversation.com/teaching-machines-to-understand-and-summarize-text-78236
https://www.mitpressjournals.org/doi/10.1162/COLI_a_00239
http://www1.cs.columbia.edu/~sbenus/Teaching/APTD/McKee_Ch1.pdf (at least skim this)

  • Extracting meaning from text
  • How do humans parse language?
  • Understanding complexity
  • Preparing to analyze texts
  • The process of text analysis

Week 4: Asking questions of your data

  • What do we mean by “question?”
  • How to define desired outcomes
  • “Accuracy” and statistical significance
  • DUE: Assignment 1

Week 5: Step 1: tokenization
Reading: NLP book pp. 43-51, chapter 5
https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization

  • Space-based vs. NLP style tokenization
  • Various tokenizer models

Note: NO CLASS MEETING on 10/12 (University Holiday)

Week 6: Step 2: Part-of-Speech (POS) tagging and Named Entity Recognition (NER)
Reading: http://www.d.umn.edu/~tpederse/Pubs/cicling2002-b.pdf http://www.nltk.org/howto/wsd.html

https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

  • What is POS tagging?
  • Stop words and dictionaries
  • What is NER and why do we care about it?
  • Lesk Algorithm and Word Sense Disambiguation (WSD)

Week 7: NLP and Project Management (note: changed topic)
Reading: NLP book, chapter 6

  • NLP projects: characteristics, planning, and other factors
  • Management buy-in, factoring for
  • DUE: Assignment 2
  • Mmidterm examination 10/31

Week 8: NLP in a nutshell
Reading: NLP book, chapter 7

  • Sentence parsing and interpretation
  • What NLP can and can’t do!
  • N-grams: individual words vs. multi-word phrases and context

Week 9: Expanding Opportunities with Linguistic forms

  • Query Expansion using synonyms, metonyms, hypernyms
  • Narrow vs. broad queries and the use of hyponyms or hypernyms

Week 10: Sentiment Analysis

  • What is “sentiment”
  • How is it used in text analytics? In business?
  • Tools and techniques
  • Limitations

Week 11: Topic Modeling

  • Extracting topics from a document corpus
  • Strategies, methods, and outcomes
  • Due: Assignment 3

Week 12: Data Governance and Ethics
Reading: https://blog.datasalon.com/2013/02/28/a-beginners-guide-to-data-governance/
https://www.talend.com/blog/2019/05/02/the-fundamentals-of-data-governance-part-1/
https://a.sfdcstatic.com/content/dam/www/ocms/assets/pdf/misc/data_Governance_Stewardship_ebook.pdf

  • Basic governance concepts
  • Data Ethics
  • Why governance matters to text analytics and NLP

Week 13: Ontologies, Taxonomies, and Dictionaries (pre-recorded, accessible via Zoom)

  • Taxonomy basics
  • Applications to machine learning
  • Ontologies
  • Dictionaries
  • Applications to text analysis
  • Final projects due
    • We will spend the last class period reviewing the results of the teams’ projects, assessing outcomes, and critiquing methods. Each team will spend 10-15 minutes presenting their results to the rest of the class. All students are required to participate.

Assignments

Assignment 0 (introduction to course environment)

Assignment 1

Assignment 2

Assignment 3

Final project (note: teams will be identified during the term) – a 10-12-page paper describing the results of the initial analysis of a small to medium sized data set, with recommendations for further work (details to be provided later)

Note: ALL assignments are to be submitted via Canvas. Do not email completed assignments to the instructor as they will not be counted or returned.

Prerequisites

Students enrolling in this course must have:

  • A background in programming, preferably in Python or a closely related object-oriented language
  • An understanding of how to use a text editor or an IDE such as Eclipse, as opposed to a word processor like MS Office or OpenOffice
  • Important! Prior experience writing code in a Linux environment via command line tools is required if you do not have a working Python installation of your own
  • Some background in linguistics or statistics may be helpful, but are not required

This course teaches the following areas:

  • Linguistic concepts that are directly applicable to text analytics
  • Practical approaches and concepts regarding text analytics using NLTK (Natural Language Tool Kit) and sPacy, plus other related tool sets
  • “Business” uses and cases for the use of text analytics and NLP

This course does NOT teach:

  • How to use a text editor
  • The Python language
  • File system concepts
  • Operating systems (e.g. Linux, Windows) or how to install/configure any of the tools used in the course

Attendance: Regular attendance of live lecture sessions is strongly encouraged so that questions can be asked and discussions participated in. Recordings will be made available weekly, and students are also expected to interact regularly (asking questions, discussing breakthroughs, and/or bringing up topics for which they need additional detail, for example) in Canvas. Course staff will monitor discussions and assist as needed. You are also encouraged to post links to interesting new tech related to the course, case studies, or other material others will find relevant. Paricipate! Have fun!

All work is to be performed using the systems provided by the Extension School; students may opt to install Python, NLTK, and other prerequisites on their own systems but will not receive any assistance from course staff and must submit assignments using Canvas. Note that while you can use Jupyter notebooks when creating code, you cannot submit them since the Canvas grading system is incapable of parsing that file format.

Important note: plagiarism of any kind will be dealt with in the strongest possible manner. Please see the student guide to academic honesty for details. https://handbook.fas.harvard.edu/book/academic-integrity

You are responsible for understanding Harvard Extension School policies on academic integrity (https://www.extension.harvard.edu/resources-policies/student-conduct/academic-integrity) and how to use sources responsibly. Not knowing the rules, misunderstanding the rules, running out of time, submitting the wrong draft, or being overwhelmed with multiple demands are not acceptable excuses. There are no excuses for failure to uphold academic integrity. To support your learning about academic citation rules, please visit the Harvard Extension School Tips to Avoid Plagiarism (https://www.extension.harvard.edu/resources-policies/resources/tips-avoid-plagiarism), where you'll find links to the Harvard Guide to Using Sources and two free online 15-minute tutorials to test your knowledge of academic citation policy. The tutorials are anonymous open-learning tools.

Accessibility: 

The Extension School is committed to providing an accessible academic community. The Accessibility Office offers a variety of accommodations and services to students with documented disabilities. Please visit https://www.extension.harvard.edu/resources-policies/resources/disability-services-accessibility for more information.

Course Information

Student Support Tips Instructor Support Tips

Course Summary:

Date Details Due