Web Scraping & Text Mining

Web Scraping & Data Mining, EUI, Florence (2018)

text

Course Description

This course will introduce students to the data science fundamentals of extraction, processing and classification of web content. It will review current methods for automated web scraping, natural language processing for parsing unstructured data and machine learning algorithms for textual data. With this in mind, the first part of the course will provide an in-depth survey of different structures and features of web content (XML, JSON, HTML, CSS-tags and XPATH) and cover the main tools for harvesting, extracting and processing the data retrieved into structured formats, using static and dynamic web pages and APIs. In a second stage, we will explore applications of machine learning algorithms to the parsed data, with a particular focus on text analysis. Under the umbrella of supervised and unsupervised learning, the course will cover traditional approaches to content analysis and dictionary-based methods, machine learning algorithms for classification, scaling methods and topic modeling. Our goal is to help students automate the extraction of online content, parse the unstructured data into formats amenable to analysis and produce quantities of interest using classification and data reduction methods, using text as data for the most part. The course will be taught in R, but we may also touch upon Python libraries for particular applications.

Links to Readings and Data

Course Structure

Day 1 - Web Scraping I:
- slides, R scripts
Day 2 - Web Scraping II and OCR:
- slides, R scripts
Day 3 - Working with Text as Data:
- slides, R scripts
Day 4 - Supervised Learning:
- slides, R scripts
Day 5 - Unsupervised Learning:
- slides, R scripts
Assessment:
- instructions, R script