The 2026 Roadmap: Data Science

A practitioner's guide to navigating careers in data science, machine learning, AI, analytics, and product management. This isn't another linear "learn Python, then XGBoost, then Docker" roadmap. This is about building real skills, understanding real problems, and finding your unique path.

Why Another Roadmap?

Good morning. So, I thought — why not compile a good list, a mind map, a sort of recipe for 2026? Something that covers the broad spectrum of how AI, machine learning, data science, and data engineering are going to look for the next year. What you can learn, build, and the problem spaces or job opportunities that can maximize your success.

Here's the thing: most roadmaps you get from WhatsApp forwards or LinkedIn — the ones your friends have liked, so they pop up in your feed — they're very linear and very surface-level. They start with "learn Python," then "learn something about attention is all you need," then "RL for it," then "agents," then "FastAPI or Docker." It's the same template for everyone. It's not tailored to the specific personality or direction you want to build.

I want to take this opportunity to discuss elaborately how companies actually see these roles, and how we can tailor something — a mind map — to different pockets or segments of practitioners, beginners, learners, people wanting to transition, or people wanting to converge into a specific field.

The Five Domains

Let's divide this clearly into five major areas:

Data Science
Machine Learning
AI
Analytics
Product Management

If you get an idea of these five, you can mix and match with other skills — leadership, sales, growth, marketing — and figure out what combination works for you.

For example, right now you might be working as a data science manager or lead machine learning engineer at some company. Or maybe you're just getting started at an MBA program or a master's program. Maybe you're interested in sales, but now everything is about AI — the peak AI moments. Then someone might suggest becoming an AI Product Manager. Or if you're good at marketing but want to transition from typical SaaS marketing to AI products and software — maybe a DevRel kind of role.

This guide is about putting out my perspective on all these things. I don't want to cover it at a shallow level. I want to take this opportunity to detail what these roles really look like, what skill sets we need, what mindset we need, and the real workouts required to see ourselves there.

Part 1: Data Science

My Journey and the Evolution of the Field

When I started my journey back in 2019, I wasn't aware of different terminologies like data scientist versus machine learning engineer. At that time, there wasn't even an "AI engineering" term — it was all machine learning engineers or data scientists. Later, I started understanding the conventions that different companies had: "Data scientists build the model, machine learning engineers deploy the model."

Then came the sudden buzz around MLOps. A lot of advocacy emerged: "You shouldn't let your model sleep in your Jupyter notebooks — it has to go to production." For that, you need to learn MLOps. The year 2022 was kind of the peak MLOps moment. Even before ChatGPT launched, many people in industry were buzzing about MLOps.

I don't want to define roles or things to learn in terms of tools, because tools are just different orchestration points. The real canvas for solving a specific problem isn't the tools — it's the concepts, the theory, the way you look at problems.

My first job title was "Trainee Decision Scientist," and we ended up doing everything. We were consultants understanding problem statements. We were marketing people doing cold emails, pamphlets, pitch calls. We were doing applied research, pulling from new models on GitHub, and deploying too. In that sense, the boundaries were fluid.

Now, at the edge of 2025's end — as I record this, it's 10 AM in New York, just hours from the new year — let me share how things have evolved and where they're heading.

What Data Science Actually Is Today

The field has evolved a lot, and now everything is very clearly defined. Data science is going to be fully centered around business-focused problems or outcome-focused things.

For example:

A company wants to forecast something
A research laboratory might do genome analysis
A newspaper journalist wants to track social media trends

It's about defining different experiments, analyzing historic data, seeing trends, and building predictive or prescriptive solutions.

If I could put it uniquely: Data science is about four pillars — descriptive, inquisitive, predictive, and prescriptive ways of solving problems or helping organizations, industries, or systems.

You have data. You have different views — aiming to predict something or prescribe something. How can I stop inventory overflows? How can I improve profit? How can I improve product usage?

Industrial Data Science vs. Academic Data Science

These days, industrial data science is different from data science happening at research institutes or laboratories. In academia, it's more interdisciplinary. For example, at Columbia, a lot of data science research happens at the intersection of:

Art and machine learning
Biology, systems biology, and machine learning
Psychology and machine learning
Statistics and politics

Data science is as simple as this: Computer Science + Statistics + Any Industry.

That's it. The industry could be operations research, finance, journalism, banking, anything. It's a very big tent kind of thing.

The Triangle: What You Fundamentally Need

What does someone fundamentally need to learn if they want to be centered on this triangle — if they still love statistics, love programming or automating things at scale, and love a specific business area or problem space?

That's when this triangle happens. And where does machine learning play a role? Obviously, it's part of statistics or computer science.

Consider the jobs at Walmart, or in India at Jio, Hotstar, Swiggy, Zomato, or even startups. They want someone with:

Very good industrial or business understanding
Ability to code and build things
Good understanding of data and statistics
Ability to build experiments or measure something

The reason I'm centering it around this — you'll understand when I get to machine learning engineer roles, how they differ. Because being in different situations requires different emphasis.

The Data Science Institute Model

I love how most data science institutes are built across the US. For example, Columbia's data science is built upon statistics, computer science, and the IEOR department. Their research activities span finance, journalism, art, creativity, nursing, healthcare, childcare — plus fundamental research.

It's not only Columbia. You see the same at University of Chicago, Brown, UC San Diego's Data Science Institute. That's how highly interdisciplinary this field is — solving important problems in social systems, business systems, or industrial systems, collaborating with big data, statistics, machine learning, and deep learning methodologies.

Paths Within Data Science

Path 1: The Generalist Data Scientist

If someone wants to get into data science, there are multiple paths. One is the generalist way — you get data, do some data analysis, try to understand or find patterns, and build a model. You might not be very critical or have different perspectives on the problem. You have a very linear approach: "There's data, I'll do analysis, and if they want forecasting, computer vision, time series, or customer segmentation, I know what to do."

That's how most data science careers start. Many still confine themselves to a very linear way of seeing things. They have specific algorithms in muscle memory: "Let's try XGBoost. Let's try some typical regression. Let's use whatever current tools are available straight out of the box."

This type of data scientist may not explore the learning theory behind machine learning or probabilistic modeling in depth. I've rarely seen data scientists who have elaborately worked on the probabilistic sides of data science.

The typical path looks like this:

Regression
XGBoost, CatBoost
Then quickly jump into a few weeks or months of learning multi-layer perceptron
Convolutional neural networks, RNNs
Then "transformers are all you need"
Build NLP solutions with BERT or fine-tuned GPT

Because that's how it is — most data science positions are problem-first, then different paradigms come in.

The Generalist Approach: A Swiss Knife

The generalist view of data science means having better understanding about problems and a very good Swiss knife of statistics and machine learning tools. Understanding data, having basic knowledge of different assumptions or rule sets to select a methodology.

For a generalist data scientist, I'd say at least understand some very unique or common problems across different industries:

Banking: Risk and forecasting, a bit of optimization or hedging. Three major things — quantifying risk, forecasting (who would churn, who would default), and hedging for better returns.

This needs a lot of statistical understanding — not just central limit theorem, t-tests, and z-tests. There exist hundreds of hypothesis tests. If you really ground yourself from a statistical point of view, you can learn different parametric and non-parametric tests, different linear and non-linear models.

People don't need to follow the typical sequence: linear regression → logistic regression → decision tree → random forest → XGBoost, then "we wrapped up machine learning," then put neural networks outside of that "classic" machine learning thing.

I'd say it's all machine learning. Just different people from different time periods came up with their solutions. If you read about Vladimir Vapnik, he mainly compared three methods: decision trees, his support vector machines, and neural networks. And actually, neural networks came way before support vector machines or random forests.

Here's my take: If someone wants to learn machine learning, they can start with neural networks directly. There's no harm. There's no requirement to learn other things first.

But for generalist data scientists, starting from statistics matters because most problems are by default grounded on statistical data analysis rather than jumping straight to models.

My suggestion: Being a generalist data science professional, it's good to start with statistics — statistical analysis, statistical learning. Then, as problems emerge, you can learn how to use neural networks or autoencoders for anomaly detection or customer segmentation. Maybe build one big neural network to do multiple tasks: customer lifetime prediction, churn risk, propensity modeling. You can build a whole network, train it — but it's a lot of effort, compute, and experimentation.

Here's the reality: Most business problems are usual problems. Companies don't want to experiment too much. They already know their top three candidates: "Let's start with XGBoost. If that doesn't work, let's do CatBoost." That's how people actually see things.

If you're optimizing for what companies expect in hiring, then be good at:

Writing good code and deploying models
Understanding data and making sense of it with statistical models and machine learning algorithms
Having interest in an industry or problem space

Growing From Generalist to Specialist

If you want to grow from generalist to specialist, you can focus on any one of these directions.

Statistical Specialization

Causal Inference

If you're doing statistical learning and analysis, you can progress into causal inference as a point of focus. Almost all companies now emerge into causal inference for understanding:

Real impact and effect of promotions and campaigns
What marketing to prioritize
The reason behind customer churn
The reason behind losses
Understanding customer behavior

Sometimes people say "pattern recognition," but there are many flaws if we just see from typical eyes — typical plotting or surface-level statistical analysis. "I plot a t-test, there's statistical significance, so this group is different from that group."

But ideally, there exists selection bias. There's not accounting for different assumptions. And t-test itself is kind of a wrong method in most scenarios. People select t-test if samples are less than 30, z-test if more than 30. That itself is flawed in this era where we have higher computation and better data availability.

Causal inference is real data science — the real science field you start getting into.

Social Science & Survey Data Path

If you're interested in social science, psychology, or consumer behavior, then focus more on statistical models — especially linear mixed-effects models. That's something you want to specialize in if you have survey data, multiple things.

Say you see yourself working with pharmaceuticals — then statistical models, especially mixed-effects models, doing proper tests, understanding statistical interpretations, that matters a lot. Many of my friends work in pharmaceutical and biopharma companies as statistical programmers or data scientists.

Quant Research Path

Or maybe you want to focus more on the statistical point of view and become a quant researcher. Most quantitative trading, investment, or finance is built on top of very good statistical models and analysis. You can explore:

Linear and non-linear models
Splines
Monte Carlo
Hidden Markov chains
Probabilistic models

It shouldn't be just "I see everything as linear regression, I just put things." There exist two perspectives:

If you just see usual methods like linear regression, logistic regression, generalized linear models — most of them sound satisfying for social science experiments or some publication-level work. You're not looking for higher associations or R-squared values. You just see trends.

But imagine focusing on quant research, quantitative trading, investments — where you want to make more value. In social science, there's no monetary value you're creating from models. It just gives direction for the next hypothesis testing.

But in business-focused problems, you always want a better model that gives more confidence — knowing what you're doing and what you're missing. In that case, grounding yourself in statistical analysis matters.

Probabilistic Modeling and Conformal Prediction

Learn probabilistic modeling so you can quantify uncertainty in your risk or predictions. There's a field called conformal prediction, fully focusing on uncertainty quantification.

If you're pivoting into insurance, investments, operational risk — where decisions are highly tied to model quality — focus on statistical things, then pivot into conformal prediction and probabilistic modeling.

Where Does Machine Learning Fit?

You might ask where machine learning comes in. The thing is, machine learning is a method or tool as part of statistical analysis itself — or probabilistic modeling, conformal prediction, or causal inference.

In causal inference, there are meta-learners, S-learners, T-learners — different methodologies, but under the methodology, you're still using XGBoost or random forest or something else.

You don't need to rigidly delineate "I just want to do machine learning only, I don't want to touch statistics." That can't happen.

For example, if you're interested in causal inference, all the causal models or estimators people use these days — double machine learning or uplift modeling in marketing/promotions-focused causal inference — still need XGBoost or some machine learning model. It's an estimator. There's nothing saying you can only use linear estimators or can't use neural networks.

That's why I said earlier: to get started as a data scientist, you need three things — stats and machine learning, coding and scaling abilities, and understanding how problems work in reality.

Computational and Algorithmic Specialization

Say you identify yourself as: "I'm not that much focused on statistics, but I'm a very good coder. I see myself good at building software."

Then I'd suggest focusing on Python (I won't divert you with R — just stick with Python). Try to build scalable data science solutions.

Example: Imagine you have 10,000 products. For all 10,000 products, you want weekly forecasting. Just think about how someone can do it.

This perspective — people more into data science from statistics or machine learning — don't really think about that much. In my early days, I never thought much about scale either. Obviously, scalable data science solutions need better thinking about systems, coding, production.

I don't call this MLOps. I'm throwing that word away for now. Think from this point of view:

The Problem: Say you want to do forecasting. A typical MLOps course never teaches you this. They always say "write a Docker container, deploy it, use observability tools to monitor." That's unwanted hype, I'd say.

The Real Pain Point: Say I have millions of customers, millions of households. Every week I want to understand their propensity toward buying something so I can schedule marketing campaigns. Or my colleague is good at suggesting models — vector autoregressions and such — but doesn't know how to scale it up. It's teamwork.

You want to position yourself: "Most data scientists know algorithms but don't know how to deal with scale of problems."

Thousands of customers, thousands of products, lakhs of combinations. I remember interviewing at Dunzo in 2022 — they had thousands of SKUs to forecast. If you have continuous data — every week, how much sales happened, some products had no sales for weeks — similar problems when I was working for a PetChem client. They had lots of product-type and location combinations. The forecasting or analysis happens at that SKU level.

We always imagine how to plot something or analyze at one product level or one snapshot of data. But when you get into industry — that's where real data science problem-solving comes in. It's not always how econometricians or social scientists or behavioral scientists deal with things. They have defined data from surveys or studies. They never worried about data itself or scale itself.

But when you switch focus to scale and complexity — that's where you need to think like an engineer.

Why Data Structures and Algorithms Matter

Sometimes courses never teach this perspective. "Data scientist means you build models. Then learn MLOps." But MLOps is poorly defined or focused.

The complexity isn't just deployment — building the solution itself is complex. MLOps is about production. What I'm saying is: even to develop something, even to test something, you need to handle complexity — thousands of customers, households, products, different locations.

That's where you need to learn:

Better programming
Distributed computing systems (Spark)
Algorithms — the typical data structures and algorithms

Why algorithms? If you see data science from the perspective of business or statistics, you won't see need for data structures and algorithms anywhere. Typical roadmaps say "learn Pandas, learn some Python tools, you're good."

But see it from the algorithms point of view: graph algorithms, bipartite matching, divide and conquer, dynamic programming. Problems are complex. We have computing systems, tools to automate and schedule (Airflow), cloud systems.

If you position yourself or specialize in computation — that's why some courses explicitly call it "computational data science." At face value, data science focuses too much on statistics and problem statements, but little on computation and complexity handling.

As I mentioned, social scientists, econometricians, behavioral scientists never had that problem. But when you get into industry with bigger problems, you need to develop it.

Coding is one part. Algorithmic thinking, data structures and algorithms — that helps a lot.

Examples:

You want to design a survey — you might need matching algorithms
Time series forecasting — you might need heuristics to segment, divide, find better models
You can't build models for each product, but also can't build one model for everything. How do you map models or techniques optimally?
Customer segmentation — somewhere you need algorithmic thinking along with distributed computing

I'd say focus on this because most people haven't. Even I've very little explored that path. Now I'm feeling we should have thought from algorithmic aspects — dynamic programming or greedy algorithms to reduce complexities.

Maybe come up with one or two prototypes or experiments. Kaggle, I'd say, is getting diluted in this perspective. They have large datasets, but people know how to game leaderboards — feature engineering and doing things.

The Overhyping Problem

I always feel courses have hyped common-sense things as something to focus on too much.

Feature engineering is common sense. There's no science or highly proven methodology. Somebody says square things up, get standard deviation, log transformation, multiply or divide something. Then comes feature stores — it keeps going.

All prior data science roadmaps and online courses abruptly lack something pragmatic. They're not teaching something — they just cover very usual things we've heard.

But if we're making ourselves specialists — training special muscle groups — we're not doing everyday warm-up cardio anymore. Now we're focusing on special muscle groups, special structure.

Don't worry too much about feature engineering and MLOps. Worry about these three things:

Statistics

Programming/Computation

Business/Industry understanding

Business and Industry Specialization

Sometimes, more than being good at algorithmic sense, computational sense, or statistical sense — you might need more business sense.

The reason: To learn reality. Problems are complex. Sometimes we need very nuanced statistical methods. But some places demand data scientists who understand complex business problems.

Example: Investment Banking

That's altogether a complex business domain. It's not something you learn from one or two books or courses. You have to immerse yourself for at least six months to a year focusing on a very specific problem.

Value at Risk alone has lots of understanding in terms of business nuances.

Example: Pharmaceutical/Biopharma

There's something called real-world evidence. Forget stats, forget computation. Think about the necessity of this analysis itself — the complexity of how data comes, how critical it is, the nuances, terminologies, and making sense of it.

Sometimes they prioritize Metric A over Metric B.

Example: Consumer/Retail

There exist two things: customer stickiness and customer lifetime value. Both give different aspects:

Customer lifetime value is futuristic — it may not change daily. You can't say yesterday was $45, today is $50.
Customer stickiness is dynamic, evolving — yesterday's is this, today's is this.

Same for banking or entertainment industries like Spotify or Netflix — average view counts, lots of defined metrics, lots of nuances about customer life journeys.

If in the future you want to build a product or start a company, you can also bet on a specific industry and then learn the other things necessary.

Example: "I'm sticking around with quant trading. Finance and quant."

Then deeply focus into all the nuances, methodologies, problem faces, how JPMC uses things, how Hudson River Trading works. Then tie back: "For these companies, what could be better models?"

You may not even need in-depth neural networks knowledge. You can keep yourself grounded in probabilistic modeling and statistical modeling — GLM is fine, some unique time series methods and statistical testing methods.

You can choose your focus instead of scattering. "I need to learn MLOps, all machine learning methods including convolutional networks, but I'm preparing for Jane Street." That doesn't make sense.

Or preparing for consumer finance, retail finance, HSBC as VP in India, insurance companies, banking sectors.

Many of my friends work in pharmaceutical and biopharma companies as statistical programmers or data scientists. Their leverage is industry understanding. Then comes the right statistical tools. Then the required programming or computational parts.

Define where you want to end up.

The Missing Piece: Optimization

Under data science, I've talked about:

Getting started as generalist
Tuning focus toward statistical/machine learning aspects
Computational/algorithmic thinking aspects
Business or industry focus

But I missed something: Optimization.

People who do optimization call themselves decision scientists. But optimization is something people can really focus on and specialize in — that's good leverage.

Consider: For larger petrochemical companies, larger inventories, supply chain companies, cargos, or production companies — what's more important for them?

Is it very nuanced statistical things? All these systems work systematically — disturbance, noise, bias, or fluctuations might be very less. The demand might change, operational capacity might change, but these don't change behavior — just change output or input values.

They want to do planning. So optimization makes a lot of sense:

Linear programming
Integer programming
Heuristics-based methodologies
Greedy search

That's why keeping algorithmic sense is important — optimization in terms of linear programming or integer programming, or heuristics. It makes a lot of sense in these industries.

Summary: The Data Science Triangle

These three elements — the triangle itself — are enough to start as a generalist, then lean toward a specialized path:

Someone who likes more statistical learning
Someone who likes planning and optimization
Someone who wants to do risk quantification
Someone who wants to do scalable data science
Someone who wants to ground themselves in specific industries

Resources I Actually Recommend

I would always suggest — I don't want to suggest some YouTube videos. If you're very fresh, want to do a good transition, or want to recharge yourself in 2026:

Stay a bit away from very usual data science YouTube channels or courses. First, figure out your own roadmap after understanding these things. Figure out where you want to end up — at least 20% of the paths where you believe 80% of growth comes from.

Books to Keep as Reference

For Statistics

"All of Statistics" by Larry Wasserman — Stick around with this one.

ISLR/ISLP — Introduction to Statistical Learning with R (now Python version available). Too good.

For Computational Thinking

CLRS — A lot of problems discussed. The authors are from industrial engineering and computer science backgrounds. One particular professor who wrote it, Cliff Stein, actually has his office inside DSI at Columbia. He's fundamentally an industrial engineer, part of the IEOR department, and takes a lot about optimization. Keep it as reference — not to finish everything, but to keep learning from it. You'll get new ideas.

Once you find yourself interested in something — finance plus data science, healthcare plus data science — explore the best lecture series from universities like Duke, Stanford, Brown, UC San Diego, University of Chicago. You can find the curricula or lecture notes. That gives you a very good surface to start learning.

Learning Approach

One very good advice: First get started with lecture notes, books, and raw materials. Then start late with some of the best independent YouTubers:

Andrej Karpathy
Christopher Molak (Causal Inference — lots of conversations)
Sebastian Raschka

They have very good standards.

Don't get into another bootcamp. There are two ways of learning:

Learning like a scholar/doer — From professors whose lecture notes or videos are out, or people who teach you thought process, ideas, fundamental concepts — not giving you a cheat sheet.
Learning like a bootcamp personality — Always optimizing toward cracking an interview. Bam, that's it.

I suggest learning like a scholar.

Books for Mindset and Soft Skills

"Noise" by Daniel Kahneman

About different biases in human judgment — helps a lot.

"Fooled by Randomness" by Nassim Taleb

Gives perspective about the quantitative side — playing with probability, statistical models versus money-focused problems like investments, trading, finance, revenue. It shows how things work in reality, how people make decisions.

"Art of Uncertainty" by David Spiegelhalter

(From the author of Art of Statistics) — Good for any data science professional.

These books can surface your learning.

Coming Up Next

In the next parts, I'll talk about:

Machine Learning Engineering
AI
Product Management
Analytics

I thought I'd cover everything in one piece, but this conversation has been very elaborate for data science. I don't want to mix or bring everything together — sometimes people might be interested specifically in product. They'd have to read everything and search for what they need.

It's better to write in different parts. This has been about one hour of compiling all these things.

This is Part 1 of the 2026 Roadmap series. Stay tuned for Part 2 covering Machine Learning Engineering roles.