Machine learning is the process of building analytical models to automatically discover previously unknown patterns from data that indicate associations, sequences, anomalies (outliers), classifications, and clusters and segments. These patterns reveal hidden rules as to why an event happened—for example, rules that predict likely customer churn. Businesses can take advantage of several kinds of uses for machine learning:
- Segmentation, or grouping sets of customers who have similar buying patterns for targeted marketing
- Classification based on a set of attributes to make a prediction—for example propensity to buy, customers with insurance policies likely to lapse and equipment failure that triggers preventive maintenance
- Forecasts—for example, sales projections based on time series
- Pattern discovery that associates one product with another to reveal cross-sell opportunities and sequences—for example, products that sell together over time
- Anomaly detection—for example, detecting fraud
Predictive analytics model methodology
The widely used Cross Industry Standard Process for Data Mining (CRISP-DM) methodology is used to develop predictive analytical models. CRISP-DM includes six phases: business understanding, data understanding, data preparation, model development using supervised and unsupervised learning, model evaluation and model deployment.
The business understanding phase involves defining the business problem or use case, the business objectives and the business questions that need to be answered. It also involves defining success criteria. Then the standard project-related tasks need to be performed. These tasks include defining resource requirements such as people and money, technology requirements, creating a project plan, defining any constraints, assessing risks and creating a contingency plan.
The data understanding phase involves data requirements such as internal and external data sources and data characteristics including data volumes, variety, velocity, formats and so on, as well as whether the data is in flat files, a relational database, a Hadoop Distributed File System (HDFS) or if it is live, streaming data.
This phase also includes data exploration using statistical analysis to look at data—for example, basic statistics about each data column and any information about whether data is skewed in any way. Visualizations such as histograms and scatterplots help with drilling down on outliers and errors. In addition, a data quality assessment involves understanding the degree to which data is missing, has errors, is inconsistent and is duplicated.
The objective of the data preparation phase is to produce a set of data that can be fed into machine-learning algorithms. This process requires a number of tasks including data enrichment, filtering and cleaning; data conversion; data transformation; and variable identification, which is also known as feature selection or dimensionality reduction. Variable identification’s objective is to create a data set of the most highly relevant variables to be used as model input to get optimal results. The intention is also to remove variables from a data set that are not useful as model input without compromising the model’s accuracy—for example, the accuracy of the predictions it makes.
The model development phase is about the development of a machine-learning model. Models can be built to predict, forecast or analyze data to find patterns such as associations and groups. Two types of machine learning can be used in model development: supervised learning and unsupervised learning.
Typically, predictive models are built using supervised learning. For example, if we want to develop a model that predicts equipment failure, we can use data that describes equipment that has actually failed. We can use that data to train the model to recognize the profile of a piece of equipment that is likely to fail. To accomplish this profile recognition, we split the data set containing failed equipment data records into a training data set and a test data set. Then we train the model by feeding the training data set into an algorithm, several of which can be used for prediction. Then we test the model using the test data set.
Unsupervised learning is a process of analyzing data to try and find hidden patterns in the data that indicate product association and groupings—for example, customer segmentation. Grouping is based on maximizing or minimizing similarity. The K-means clustering algorithm is a widely used algorithm for this approach. Predictive and descriptive analytical models can be built using advanced analytics or data mining tools, data science interactive workbooks with procedural or declarative programming languages, analytics clouds and automated model development tools.
After a model is developed, the next phase is to evaluate the accuracy of predictions or groupings. For predictions, this evaluation means understanding how many predictions were correct and how many were incorrect. Various methods can accomplish this evaluation. Key measures in model evaluation are the number of true positives, false positives, true negatives and false negatives. The bottom line is that we need to make sure that the model is accurate; otherwise, it could generate lots of false positives that may result in incorrect decisions and actions.
Once we are happy with the model we’ve developed, the final phase involves deploying models to run in many different environments. These environments include spreadsheets, analytics servers, applications, database management systems (DBMSs), analytical relational database management systems (RDBMSs), Apache Hadoop, Apache Spark and streaming analytics platforms: