Learning | Data

Our take on the ideas, information, and tools that make data work.

Tara Donovan exhibition.

Solving the right problem

Max Shron and Sasha Laundy explore tactics for need-finding and problem scoping that make it possible to put investments in data to profitable use.

William Caxton showing specimens of his printing to King Edward IV and his Queen.

Easy, reproducible reports with R

Garrett Grolemund demonstrates how to use R Markdown to combine code and text into a single .Rmd file to generate polished reports automatically in a variety of formats.

Frank Gehry's Dancing House windows.

Best practices for streaming applications

Mark Grover and Ted Malaska offer an overview of projects for streaming applications, including Kafka, Flume, and Spark Streaming, and discuss the architectural schemas available, such as Lambda and Kappa.

The color frontispiece from Albert Henry Munsell's 1905 pamphlet "A Color Notation."

Running Spark on Alluxio with S3

Calvin Jia presents an in-depth overview of Alluxio and its role in the big data ecosystem. In this segment, he reviews examples that show how Alluxio complements Spark and S3, to enable fast data access.

Herding the crowd.

Organizing big data with the crowd

Using real-world cases, Lukas Biewald describes microtasking, where it fits in the crowdsourcing landscape, and how data scientists and developers can tap into the crowd to collect and process data sets.

Ornamental bars

Securing Apache Kafka

Jun Rao explains the threats that Kafka Security mitigates, the changes that were made to Kafka to enable security, and the steps required to secure an existing Kafka cluster.


What is a resilient distributed dataset?

Alex Robbins guides you through an in-depth look at the Python API for Apache Spark. In this segment, he explores RDDs--the central abstraction in Spark and essential knowledge for anyone working in the system.

Expedition 47 Commander Tim Kopra of NASA captured this brightly lit night image of the city of Chicago on April 5, 2016, from the International Space Station.

Dive into scikit-learn

With scikit-learn, you can deploy machine learning models in just a few lines of code. Andreas Mueller summarizes the classification, regression, and clustering algorithms in this powerful machine learning library.

"Preparation for WAR to defend Commerce," <em>Birch's Views of Philadelphia</em>.

Building data science teams: Preparing your organization

How should you prepare when assembling and integrating a data science team into your organization? In this video training segment, Paco Nathan offers tips to consider in the early stages, including designating the right executive sponsor and encouraging basic hands-on data science training for management.

The bridge over the Crim Dell at the College of William and Mary

Architecting Hadoop Applications

In this O'Reilly training video, the "Hadoop Application Architectures" authors present an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. In this segment, they provide an overview of the complete architecture. Presenters: Mark Grover, Gwen Shapira, Jonathan Seidman, Ted Malaska