Course Syllabus
Syllabus |
---|
ISMT E-117 (#16099) Instructor: Richard Joltes, ALM in IT, Harvard University Division of Continuing EducationMeetings: Online via Zoom, starting Monday 8/31/2020 5:30 – 7:30PM IntroductionThe extraction of relevant information from a mass of raw, unstructured text can provide a cornucopia of useful insights, which can then be used to drive business decisions in a variety of contexts. If, for example, a manufacturer performs analytics on Voice Of Customer (VOC) texts extracted from vendor sites or their own customer service records, they may identify a product defect or service issue quickly enough to react before it becomes a newsworthy (or worse, litigation producing) event. From another perspective, if patient information systems aggregate records together and perform textual analysis on physician notes or diagnoses, an impending disease outbreak may be identified sufficiently in advance to help improve the medical community’s response. The opportunities for leveraging textual information are endless, yet many organizations are ill prepared or equipped to handle both the volume and the variety. Analysis of free-form text is messy and difficult since language is fluid and usages often vary from one region to another. However, many tools, both free and commercial, are available from a variety of sources. Some are complete, out-of-box solutions while others are toolkits requiring some level of programming experience to implement in a given context. This course introduces students to the tools, techniques, and opportunities for performing text analytics in a variety of contexts. We examine tools such as NLTK, spaCy, and the WordNet dictionary along with fully featured applications such as IBM’s Watson Explorer analytics platform. The new BERT model will be explored if time allows. Significant discussion will also be devoted to organizational aspects, such as governance, data integrity, and the process of identifying/processing a body of texts in order to address a business need. Course work involves using the selected tools to analyze groups of texts for insights such as:
We will also spend significant time discussing basic linguistic concepts such as word senses, the various “-nym” forms (synonyms, homonyms, meronyms, etc.), lemmatization, stemming, and other areas relevant to search systems and text analysis. Note that this is an ISMT course, so while significant coding in Python is involved the intent of the course is the junction of technology and business requirements. Books/ReadingsSrinivasa-Desikan, Bhargav, Natural Language Processing and Computational Linguistics, Pakt 2018 You can obtain the book via any online reseller, or via the Harvard Coop. ScheduleWeek 1: Introduction: what are we studying?
Note: NO CLASS MEETING on 9/7 (University holiday)
Week 3: A deeper dive into language analysis Reading: http://theconversation.com/teaching-machines-to-understand-and-summarize-text-78236
Week 4: Asking questions of your data
Week 5: Step 1: tokenization
Note: NO CLASS MEETING on 10/12 (University Holiday) Week 6: Step 2: Part-of-Speech (POS) tagging and Named Entity Recognition (NER) https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
Week 7: NLP and Project Management (note: changed topic)
Week 8: NLP in a nutshell
Week 9: Expanding Opportunities with Linguistic forms
Week 10: Sentiment Analysis
Week 11: Topic Modeling
Week 12: Data Governance and Ethics
Week 13: Ontologies, Taxonomies, and Dictionaries (pre-recorded, accessible via Zoom)
AssignmentsAssignment 0 (introduction to course environment) Assignment 1 Assignment 2 Assignment 3 Final project (note: teams will be identified during the term) – a 10-12-page paper describing the results of the initial analysis of a small to medium sized data set, with recommendations for further work (details to be provided later) Note: ALL assignments are to be submitted via Canvas. Do not email completed assignments to the instructor as they will not be counted or returned. PrerequisitesStudents enrolling in this course must have:
This course teaches the following areas:
This course does NOT teach:
Attendance: Regular attendance of live lecture sessions is strongly encouraged so that questions can be asked and discussions participated in. Recordings will be made available weekly, and students are also expected to interact regularly (asking questions, discussing breakthroughs, and/or bringing up topics for which they need additional detail, for example) in Canvas. Course staff will monitor discussions and assist as needed. You are also encouraged to post links to interesting new tech related to the course, case studies, or other material others will find relevant. Paricipate! Have fun! All work is to be performed using the systems provided by the Extension School; students may opt to install Python, NLTK, and other prerequisites on their own systems but will not receive any assistance from course staff and must submit assignments using Canvas. Note that while you can use Jupyter notebooks when creating code, you cannot submit them since the Canvas grading system is incapable of parsing that file format. Important note: plagiarism of any kind will be dealt with in the strongest possible manner. Please see the student guide to academic honesty for details. https://handbook.fas.harvard.edu/book/academic-integrity You are responsible for understanding Harvard Extension School policies on academic integrity (https://www.extension.harvard.edu/resources-policies/student-conduct/academic-integrity) and how to use sources responsibly. Not knowing the rules, misunderstanding the rules, running out of time, submitting the wrong draft, or being overwhelmed with multiple demands are not acceptable excuses. There are no excuses for failure to uphold academic integrity. To support your learning about academic citation rules, please visit the Harvard Extension School Tips to Avoid Plagiarism (https://www.extension.harvard.edu/resources-policies/resources/tips-avoid-plagiarism), where you'll find links to the Harvard Guide to Using Sources and two free online 15-minute tutorials to test your knowledge of academic citation policy. The tutorials are anonymous open-learning tools. Accessibility:The Extension School is committed to providing an accessible academic community. The Accessibility Office offers a variety of accommodations and services to students with documented disabilities. Please visit https://www.extension.harvard.edu/resources-policies/resources/disability-services-accessibility for more information. |
Course Information |
Course Summary:
Date | Details | Due |
---|---|---|