Machine Learning: Datasets for Regression with Python
Synopsis
In “What Is Machine Learning” (https://sci-en-tech.com/ebooks/), we introduced machine learning (ML) at a high level and summarized its core concepts. A central takeaway was that modern ML models are fundamentally data-driven. Consequently, building a reliable model depends critically on a well-prepared dataset; thorough understanding, careful examination, and appropriate treatment of data are essential for both effective training and rigorous testing. Building on this, in “Machine Learning: Datasets for Classification with Python” (https://sci-en-tech.com/ebooks/), we discussed the creation, examination, and preprocessing of data specifically for classification tasks.
This booklet focuses on several major publicly available datasets used for training supervised regression ML models.
· California Housing Dataset
· Diabetes Dataset
· Airfoil Self-Noise Dataset
· Concrete Compressive Strength Dataset
· Energy Efficiency Dataset
· Bike Sharing Dataset
· Give Me Some Credit Dataset
· Superconductivity Dataset
Using these datasets, we provide practical demonstrations of how to load data in Python—using scikit-learn or pandas—inspect their structure, and visualize representative samples. These demonstrations illustrate the fundamental principles of data management within a typical machine learning workflow. In particular, we address the following concepts, techniques, rules, and procedures:
· Data Normalization
· Special Mapping and Transformations
· Feature Extraction from Temporal Data
· Encoding Cyclical Features
· One-Hot Encoding
· Feature Importance Examination
· Feature Reduction via PCA
· Split-Fit-Transform Rule
These concepts are essential for improving the computational stability of machine learning models and for ensuring effective generalization to unseen data. High-quality data are the cornerstone of high-quality machine learning: if the underlying data are flawed, the resulting model will inevitably reflect those flaws, regardless of the training algorithm employed. This booklet presents the major techniques for preparing datasets ready for training supervised regression machine learning models.
Although these techniques are presented in the context of regression models, they are also applicable to classification models, with appropriate consideration of differences in the target variable.
Note: This booklet focuses exclusively on data preparation; the training of ML models will be addressed in subsequent volumes.