MACHINE LEARNING
In my first position at the CBS, my team worked on an ML model to classify Census records. Previously, this classification was done manually by CBS employees, taking years for each Census.
Our goal was to classify over 830,000 text records into profession and industry categories with 90%+ accuracy, despite having no context.

by Tamar Dobrin

Why Machine Learning?
  • Machine Learning has transformed how we handle large-scale classification problems like the Census project.
  • Traditional methods (manual classification) are slow, costly, and prone to errors.
  • ML enables automation—it helps scale decision-making, improve accuracy, and reduce human effort.
  • Many industries rely on ML for data-driven insights, automation, and efficiency.

To understand how ML achieves this, let's take a step back and see how it fits into AI.
What is AI?
Definition
Exceeding / matching the capabilities of a human (or an animal).
What could that involve? For instance, the ability to discover, to find out new information. Another one is the ability to infer, to read in information from other sources that may not be explicitly stated. And the ability to reason, to figure thing out.
Leveraging computers or machines to mimic the problem solving or the decision-making capabilities of the human mind.
Not All AI is ML
Not all AI is ML. For example, rule-based systems can use predefined logical rules to analyse medical data and provide diagnostic recommendations, without needing to learn from data patterns.
Typical chess playing engines will be considered AI but not ML, because they follow specific rules and search algorithms.
Implementation
Classic AI is generally a program, a collection of rules.
AI in today’s terms, went through a very deep paradigm shift, the so-called ML.
Machine Learning
Machine Learning is a subset of artificial intelligence (AI) that enables computers to learn from data and make predictions without being explicitly programmed.
ML involves predictions or decisions based on data. Basically, its a very sophisticated form of statistical analysis. Its looking for predictions based upon information that we have. So, the more we feed unto the system, the more its able to give us accurate predictions and decisions based upon that data.
ML is more focused on the use of various self learning algorithms that derive knowledge from data to predict outcomes.
Types of Machine Learning
1
Supervised Learning
Model learns from labeled data. Most industry systems use this approach.
Supervised learning has 2 subcategories:
Regression:
Predict a continuous numeric target variable for a given input variable
For example predict the price of a house given any number of features of a house and determine their relationship to the final price of the house.
Classification:
Predict discrete categorical variable ("label"/"class")
for example if we want to sign the label spam/not spam to an email based on its content, sender and so on. But, we can also have more that 2 classes, for example junk, primary, social promotion as gmail does by default.
2
Unsupervised Learning
Model finds patterns in unlabeled data. Less expensive but less reliable than supervised learning.
3
Reinforcement Learning
Model learns by interacting with an environment and receiving rewards. Used for games and autonomous vehicles.
Key ML Concepts
ML models have 2 primary stages:
1
Training (Optimization)
Begins with a mathematical model mapping inputs to outputs. Parameters change gradually until desired mapping is achieved.
2
Inference
After training, the model runs on new inputs to make predictions.
Underfitting & Overfitting
Underfitting
Occurs when the model lacks sufficient training data to recognize patterns.
Overfitting
Model is too complex and learns the noise in the training data, leading to excellent performance on training data but poor generalization on unseen data.
Performance Metrics
Overview
When evaluating a machine learning model, especially for classification tasks, we use different performance metrics to measure how well the model predicts outcomes.
Confusion Matrix
Confusion Matrix is a table that helps calculate performance metrics like Accuracy, Precision, Recall, and F1-score. It provides a detailed breakdown of a classification model’s performance by showing how many predictions fall into four categories:
-True Positive (TP) → Model correctly predicted the positive class.
-True Negative (TN) → Model correctly predicted the negative class.
-False Positive (FP) (Type I Error) → Model incorrectly predicted positive when it should be negative.
-False Negative (FN) (Type II Error) → Model incorrectly predicted negative when it should be positive.
Key Metrics
  • Accuracy: The percentage of correct predictions out of total predictions
When to Use?
-Works well when the dataset has a balanced number of classes.
-Not reliable for imbalanced datasets (e.g., if 95% of emails are "not spam," a model that always predicts "not spam" will have 95% accuracy but is useless for detecting spam).
  • Precision (Positive Predictive Value): Out of all predicted positive cases, how many were actually correct?
When to Use?
-When false positives are costly, e.g., in spam detection (you don’t want important emails marked as spam).
-In medical diagnosis (you don’t want to falsely diagnose a healthy person as having a disease).
  • Recall (Sensitivity or True Positive Rate): Out of all actual positive cases, how many did the model correctly identify?
When to Use?
-When false negatives are costly, e.g., disease detection (missing a cancer case is dangerous).
-In fraud detection, it's better to flag some false positives than to miss real fraud cases.
  • F1-Score (Harmonic Mean of Precision & Recall): Balance between precision and recall
When to Use?
-When you need a trade-off between Precision and Recall.
-Useful when you have an imbalanced dataset
ML Algorithms & Techniques
Machine Learning (ML) algorithms can be broadly categorized into different types based on their approach to learning from data. Some of the most commonly used algorithms include Linear Regression, Decision Trees, Random Forests, Support Vector Machines (SVMs), and Neural Networks.
Linear Regression
Linear Regression is a supervised learning algorithm used for predicting continuous values. It establishes a relationship between the input features (independent variables) and the output (dependent variable) using a straight line.
How it Works
  • The model finds the best-fitting line through the data points.
  • The equation for the line is: y=mx+b where:
  • y is the predicted value
  • x is the input feature
  • m is the slope (weight)
  • b is the intercept (bias)
Example Use Case
Predicting house prices based on square footage
Decision Trees
Decision Trees are supervised learning algorithms used for both classification and regression tasks. They work by splitting data into branches based on feature values, forming a tree-like structure.
How it Works
  • The model starts at the root and splits the dataset at different feature values.
  • Each split creates branches until a final decision (leaf node) is reached.
Example Use Case
  • Customer segmentation (e.g., predicting if a customer will buy a product based on their browsing behavior).
Random Forest
Random Forest is an ensemble learning algorithm that combines multiple Decision Trees to improve accuracy and reduce overfitting.
How it Works
  • The algorithm creates multiple Decision Trees using different subsets of data (bootstrapping).
  • It then takes the majority vote (for classification) or averages predictions (for regression).
Example Use Case
  • Fraud detection in banking.
Support Vector Machines
SVMs are powerful classification algorithms that work well with small and complex datasets. They aim to find the best decision boundary (hyperplane) that separates different classes.
How it Works
  • The model finds a hyperplane that maximizes the margin between different classes.
  • Works well for both linear and non-linear classification.
Example Use Case
  • Handwritten digit recognition.
Neural Networks & Deep Learning
Deep Learning
Deep Learning is a subset of ML that involves Neural Networks with multiple hidden layers, making it capable of learning complex patterns from large datasets.
How it Works
  • Uses multiple layers of neurons to extract higher-level features.
  • Requires large datasets and significant computational power.
Example Use Case
  • Autonomous vehicles (self-driving cars).
Natural language processing (e.g., ChatGPT).
Neural Networks
Neural Networks are inspired by the human brain and consist of layers of artificial neurons that process data and learn patterns.
How it Works
  • Consists of input, hidden, and output layers.
  • Each neuron in a layer receives weighted inputs, applies an activation function, and passes the output to the next layer.
Example Use Case
  • Image recognition (e.g., detecting objects in photos).
Conclusion
Linear Regression
Simple and effective for continuous predictions.
Decision Trees & Random Forest
Interpretable but can overfit. Random Forest improves by using multiple models.
SVMs
Excellent for small, complex datasets.
Neural Networks
Powerful for complex tasks but require large datasets and computing resources.
Each technique has unique advantages and is selected based on the specific problem requirements.