concept time series in category machine learning

appears as: time series, time series
Real-World Machine Learning

This is an excerpt from Manning's book Real-World Machine Learning.

Feature engineering enables you to use unstructured data sources in ML models. Many data sources aren’t inherently structured into feature vectors that can be directly inserted into the ML framework presented in the first four chapters. Unstructured data such as text, time series, images, video, log data, and clickstreams account for the vast majority of data that’s created. Feature engineering is what enables ML practitioners to produce ML feature vectors out of these kinds of raw data streams.

Many datasets that are amassed by modern data-collection systems come in the form of time series, measurements of a process or set of processes across time. Time-series data is valuable because it provides a window into the time-varying characteristics of the subjects at hand and enables ML practitioners to move beyond employing static snapshots of these subjects to make predictions. But fully extracting the value out of time-series data can be difficult. This section describes two common types of time-series data—classical time series and point processes (event data)—and details some of the most widely used time-series features.

  • Average— The mean or median of the measurements can uncover tendencies in the average value of a time series.
  • Next, you move to more-sophisticated classical time-series features. Autocorrelation features measure the statistical correlation of a time series with a lagged version of itself. For example, the one-autocorrelation feature of a time series takes the original time series and correlates it with the same time series shifted over by one time bin to the left (with nonoverlapping portions removed). By shifting the time series like this, you can capture the presence of periodicity and other statistical structure in the time series. The shape of the autocorrelation function (autocorrelation computed over a grid of time lags) captures the essence of the structure of the time series. In Python, the statsmodels module contains an easy-to-use autocorrelation function. Figure 7.7 shows how the autocorrelation is computed and plots an autocorrelation function for the SF crime data.

    Figure 7.7. Top: Correlation of the original time series and 12-month lagged time series defines the 12-month autocorrelation. Bottom: The autocorrelation function for the SF crime data. The autocorrelation is high for short time scales, showing high dependence of any month’s crime on the previous months’ values.

    Fourier analysis is one of the most commonly used tools for time-series feature engineering. The goal of Fourier analysis is to decompose a time series into a sum of sine and cosine functions on a range of frequencies, which are naturally occurring in many real-world datasets. Performing this decomposition enables you to quickly identify periodic structure in the time series. The Fourier decomposition is achieved by using the discrete Fourier transform, which computes the spectral density of the time series—how well it correlates to a sinusoidal function at each given frequency—as a function of frequency. The resulting decomposition of a time series into its component spectral densities is called a periodogram. Figure 7.8 shows the periodogram of the San Francisco crime data, computed using the scipy.signal.periodogram function (several Python modules have methods for periodogram estimation). From the periodogram, various ML features can be computed, such as the spectral density at specified frequencies, the sum of the spectral densities within frequency bands, or the location of the highest spectral density (which describes the fundamental frequency of oscillation of the time series). The following listing provides example code for periodogram computation and features.

    sitemap

    Unable to load book!

    The book could not be loaded.

    (try again in a couple of minutes)

    manning.com homepage