15 Probability & Statistics Interview Questions

You probably solved Bayes’ Theorem in college and decided you’re “good at statistics.” But interviews reveal something else: most candidates don’t fail because they can’t code. They fail because they can’t think probabilistically.

Writing Python is easy. Reasoning under uncertainty isn’t.

In real-world data science, weak statistical intuition is expensive. Misread an A/B test, misjudge variance, or ignore bias, and the business pays for it. What separates strong candidates from average ones isn’t formula recall. It’s clarity around distributions, assumptions, and trade-offs. In this article, I walk through 15 probability and statistics questions that actually show up in interviews, and more importantly, how to think through them.

Core Probability Foundations

These questions evaluate whether you can reason about conditional probability, event independence, and data-generating processes, and don’t just memorise formulas. In short, they test if you truly understand uncertainty and distributions.

Q1. What is Bayesian Inference and the Monty Hall Paradox?

One of the most persistent evaluations of probabilistic intuition involves the Monty Hall problem. A contestant is presented with three doors: behind one is a car, and behind the other two are goats. After selecting a door, the host—who knows the contents—opens another door to reveal a goat. He then offers the contestant the chance to switch. The interviewer seeks to determine if the candidate can move beyond the “50/50” fallacy and apply Bayesian updating to realize that switching provides a 2/3 probability of winning.

Interviewers ask this question to assess whether the candidate can handle conditional probability and understand information gain. It reveals whether an individual can update their “priors” when presented with new, non-random evidence. The host does not act randomly. The contestant’s initial choice and the actual location of the car constrain the host’s decision. The ideal answer utilizes Bayes’ Theorem to formalize the updating process:

P(H | E) =

P(E | H) ·
P(H)

P(E)

In this framework, the initial probability P(Car) is 1/3 for any door. When Monty opens a door, he provides evidence E. If the car is behind the door the contestant initially chose, Monty has two goats to choose from. If the car is behind one of the other doors, Monty is forced to open the only remaining door with a goat. This asymmetry in the likelihood function P(E|H) is what shifts the posterior probability to 2/3 for the remaining door.

Door Status
Probability (Initial)
Probability (After Monty Opens a Door)

Initial Choice
1/3
1/3

Opened Door
1/3
0

Remaining Door
1/3
2/3

Q2. What is The Poisson vs. Binomial Distribution Dilemma

In product analytics, a recurring challenge is determining the appropriate discrete distribution for modeling events. Interviewers often ask candidates to contrast the Poisson and Binomial distributions and explain when to use one over the other. They use this question to test whether the candidate truly understands the assumptions behind different data-generating processes.

The Binomial distribution models the number of successes in a fixed number of independent trials (n). Here each trial has a constant probability of success (p). Its probability mass function is defined as:

P
(X = k)
=

(

)

pk
(1 − p)n − k

In contrast, the Poisson distribution models the number of events occurring in a fixed interval of time or space. It assumes events occur with a known constant mean rate (lambda) and independently of the time since the last event. Its probability mass function is:

P
(X = k)
=

λk
e−λ

The nuanced answer highlights the “Poisson Limit Theorem.” Here, the Binomial distribution converges to the Poisson as n becomes very large and p becomes very small. With this, np = lambdaA.

A practical example in data science would be modeling the number of users who convert on a website in a day versus modeling the number of server crashes per hour.

You can check out our guide on probability distributions for data science here.

Q3. Explain The Law of Large Numbers and The Gambler’s Fallacy

This question is a conceptual trap. The Law of Large Numbers (LLN) states that as the number of trials increases, the sample average will converge to the expected value. The Gambler’s Fallacy, however, is the mistaken belief that if an event has occurred more frequently than normal, it is “due” to happen less frequently in the future to “balance” the average.

Interviewers use this to identify candidates who might erroneously introduce bias into predictive models. This can happen while assuming a customer is less likely to churn simply because they have been a subscriber for a long time. The mathematical distinction is independence. In a series of independent trials (like coin flips), the next outcome is entirely independent of the past. The LLN works not by “correcting” past results but by swamping them with a massive number of new, independent observations.

Statistical Inference & Hypothesis Testing

These are the backbone of data science interviews. This cluster tests whether you understand sampling distributions, uncertainty, and how evidence is quantified in real-world decisions.

Q4. What is the Central Limit Theorem and Statistical Robustness

The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. Interviewers ask for its definition and practical significance to verify that the candidate understands the justification for using parametric tests on non-normal data. The CLT states that the sampling distribution of the sample mean will approach a normal distribution as the sample size (n) increases, regardless of the population’s original distribution, provided the variance is finite.

The significance of the CLT lies in its ability to allow us to make inferences about population parameters using the standard normal distribution. For a population with mean ‘mu’ and standard deviation ‘sigma’, the distribution of the sample mean X-bar converges to:

X
¯

(

μ,

σ2

)

A senior candidate will explain that this convergence enables the calculation of p-values and confidence intervals for metrics like Average Revenue Per User (ARPU) even when individual revenue data is highly skewed (e.g., Pareto-distributed). To visualize this, Python’s scipy and seaborn libraries are often used to show how the distribution of means becomes increasingly bell-shaped as the sample size moves from $n=5$ to $n=30$ and beyond.

Code:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Skewed population
pop = np.random.exponential(scale=2, size=100000)
def plot_clt(population, sample_size, n_samples=1000):
means = [np.mean(np.random.choice(population, size=sample_size)) for _ in range(n_samples)]
sns.histplot(means, kde=True)
plt.title(f”Sample Size: {sample_size}”)
plt.show()
plot_clt(pop, 100)

5. P-Values and the Null Hypothesis Significance Testing (NHST) Framework

Defining a p-value is perhaps the most common interview question in data science, yet it is where many candidates fail by providing inaccurate definitions. A p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming the null hypothesis (H_0) is true.

Interviewers ask this to determine if the candidate understands that a p-value is NOT the probability that the null hypothesis is true, nor is it the probability that the observed effect is due to chance. It is a measure of evidence against H_0. If the p-value is below a pre-determined significance level (alpha), typically 0.05, we reject the null hypothesis in favor of the alternative hypothesis (H_a).

A high-level response should discuss the “Multiple Comparisons Problem,” where performing many tests increases the likelihood of a Type I error (False Positive). The candidate should mention corrections such as the Bonferroni correction, which adjusts the alpha level by dividing it by the number of tests performed.

6. Type I vs. Type II Errors and the Trade-off of Power

Understanding the business consequences of statistical errors is vital. A Type I error (alpha) is a false positive rejecting a true null hypothesis. A Type II error (beta) is a false negative failing to reject a false null hypothesis.

Interviewers ask this to gauge the candidate’s ability to balance risk. Statistical Power (1-beta) is the probability of correctly detecting a true effect. In industry, the choice between minimizing Type I or Type II errors depends on the cost of each. For example, in spam detection, teams often consider a Type I error (flagging an important email as spam) more costly than a Type II error (letting a spam email reach the inbox). As a result, they prioritize higher precision over recall.

A strong answer also connects statistical power to sample size. To increase power without increasing (\alpha), you must increase the sample size (n) or detect a larger effect size.

Error Type

Statistic

Definition

Decision Consequence

Type I

False Positive

Implementing an ineffective change.

Type II

False Negative

Missing a revenue-generating opportunity.

You can understand this difference in detail here.

7. What is the difference between Confidence Intervals and Prediction Intervals in Forecasting

Many candidates confuse these two intervals. A Confidence Interval (CI) provides a range for the mean of a population parameter with a certain level of confidence (e.g., 95%). A Prediction Interval (PI) provides a range for an individual future observation.

Interviewers ask this to test the candidate’s understanding of uncertainty. The PI is always wider than the CI because it must account for both the uncertainty in estimating the mean (sampling error) and the natural variance of individual data points (irreducible noise). In business, teams use a confidence interval (CI) to estimate the average growth of a metric, while they use a prediction interval (PI) to forecast what a specific customer might spend in the future.

The formula for a prediction interval includes an extra variance term σ^2 to account for individual-level uncertainty:

PI =

y
ˆ

α/2,n−2

√

SE(

y
ˆ

)2

σ
2

This demonstrates a rigorous understanding of the components of variance in regression models.

Experimental Design & A/B Testing

This is where statistics meets product analytics. These questions check whether you can design robust experiments and choose the correct testing framework under real constraints.

8. Sample Size Determination for A/B Testing

Calculating the required sample size for an experiment is a core data science task. The interviewer wants to know if the candidate understands the relationship between the Minimum Detectable Effect (MDE), significance level (alpha), and power (1-beta).

The MDE is the smallest change in a metric that is business-relevant. A smaller MDE requires a larger sample size to distinguish the “signal” from the “noise”. The formula for sample size (n) in a two-sample test of proportions (standard for conversion rate A/B tests) is derived from the requirement that the distributions of the null and alternative hypotheses overlap by no more than alpha and beta:

n
≈

(
zα/2
+
zβ
)2

· 2 ·

p
(1 − p)

MDE2

Where p is the baseline conversion rate. Candidates should demonstrate proficiency with Python’s statsmodels for these calculations:

from statsmodels.stats.power import NormalIndPower
import statsmodels.stats.proportion as proportion
# Effect size for proportions
h = proportion.proportion_effectsize(0.10, 0.12) # 10% to 12% conversion
analysis = NormalIndPower()
n = analysis.solve_power(effect_size=h, alpha=0.05, power=0.8, ratio=1.0)
print(f”Sample size needed per variation: {int(np.ceil(n))}”)

This shows the interviewer that the candidate can translate theory into the engineering tools used daily

9. What is the difference between Stratified Sampling and Variance Reduction?

Interviewers often ask for the difference between simple random sampling (SRS) and stratified sampling to evaluate the candidate’s proficiency in experimental design. SRS ensures every member of the population has an equal chance of selection, but it can suffer from high variance if the population is heterogeneous.

Stratified sampling involves dividing the population into non-overlapping subgroups (strata) based on a specific characteristic (e.g., age, income level) and then sampling randomly from each stratum. This method is asked to see if the candidate knows how to ensure representation and reduce the standard error of the estimate. By ensuring that each subgroup is adequately represented, stratified sampling “blocks” the variance associated with the stratifying variable, leading to more precise estimates than SRS for the same sample size.

Sampling Method

Primary Advantage

Typical Use Case

Simple Random

Simplicity; lack of bias.

Homogeneous populations.

Systematic

Efficiency; spread across intervals.

Quality control on assembly lines.

Stratified

Precision; subgroup representation.

Opinion polls in diverse demographics.

Cluster

Cost-effectiveness for dispersed groups.

Large-scale geographic studies.

The “ideal” answer notes that stratified sampling is particularly critical when dealing with imbalanced datasets, where SRS might miss a small but statistically significant subgroup entirely.

Check out all the types of sampling and sampling techniques here.

10. What is the difference between Parametric vs. Non-Parametric Testing

This question assesses the candidate’s ability to choose the correct statistical tool when the assumptions of normality are violated. Parametric tests (like t-tests, ANOVA) assume the data follow a specific distribution and are generally more powerful. Non-parametric tests (like Mann-Whitney U, Wilcoxon Signed-Rank) make no such assumptions and are used for small samples or highly non-normal data.

A sophisticated answer discusses the trade-offs: while non-parametric tests are more “robust” to outliers, they have less statistical power, meaning they are less likely to detect a real effect if it exists. The candidate might also mention “Bootstrapping,” a resampling technique used to estimate the sampling distribution of any statistic without relying on parametric assumptions.

Check out our complete guide on parametric and non-parametric testing here.

Statistical Learning & Model Generalization

Now we move from inference to machine learning fundamentals. Interviewers use these to test whether you understand model complexity, overfitting, and feature selection — not just how to use sklearn.

11. Explain the Bias-Variance Trade-off and Model Complexity

In the context of statistical learning, interviewers ask about the bias-variance trade-off to see how the candidate manages model error. Total error can be decomposed into:

Error =

(Bias)2

Variance

Irreducible Error

High bias (underfitting) occurs when a model is too simple and misses the underlying pattern in the data. High variance (overfitting) occurs when a model is too complex and learns the noise in the training data, leading to poor generalization on unseen data.

The interviewer is looking for techniques to manage this trade-off, such as cross-validation to detect overfitting or regularization to penalize complexity. A data scientist must find the “sweet spot” that minimizes both bias and variance, often by increasing model complexity until validation error starts rising while training error continues to fall.

12. What is the difference between L1 (Lasso) vs. L2 (Ridge) Regularization

Regularization is a statistical technique used to prevent overfitting by adding a penalty term to the loss function. Interviewers ask for the difference between L1 and L2 to test the candidate’s knowledge of feature selection and multicollinearity.

L1 regularization (Lasso) adds the absolute value of the coefficients as a penalty: lambda summation of |w_i|. This can force some coefficients to zero, making it useful for feature selection. L2 regularization (Ridge) adds the square of the coefficients: lambda summation of w_i^2. It shrinks coefficients towards zero but rarely to zero, making it effective at handling multicollinearity where features are highly correlated.

Regularization

Penalty Term

Effect on Coefficients

Primary Use Case

L1 (Lasso)

λ Σ |wi|

Sparsity (zeros).

Feature selection.

L2 (Ridge)

λ Σ wi2

Uniform shrinkage.

Multicollinearity.

Elastic Net

Both (L1 + L2)

Hybrid.

Correlated features + selection.

Using L2 is generally preferred when you suspect most features contribute to the outcome, whereas L1 is better when you believe only a few features are truly relevant.

Learn more about regularisation in machine learning here.

13. What is Simpson’s Paradox and the Dangers of Aggregation

Simpson’s Paradox occurs when a trend appears in multiple subgroups but disappears or reverses when the groups are combined. This question is a favorite for evaluating a candidate’s ability to spot confounding variables.

A classic example involves kidney stone treatments. Treatment A might have a higher success rate than Treatment B for both small stones and large stones when viewed separately. However, because Treatment A is disproportionately given to “harder” cases (large stones), it may appear less effective overall in the aggregate data. The “lurking variable” here is the severity of the case.

The interviewer wants to hear that the candidate always “segments” data and checks for class imbalances before drawing conclusions from high-level averages. Causal graphs (Directed Acyclic Graphs or DAGs) are often mentioned by senior candidates to identify and “block” these confounding paths.

14. What is Berkson’s Paradox and Selection Bias

Berkson’s Paradox, also known as collider bias, occurs when two independent variables appear negatively correlated because the sample is restricted to a specific subset. A famous example is the observation that in hospitals, patients with COVID-19 seem less likely to be smokers. This happens because “hospitalization” acts as a collider — severe COVID-19 or a smoking-related illness leads doctors to hospitalize the patient. If a patient does not have severe COVID-19, they are statistically more likely to be a smoker to justify their presence in the hospital.

Interviewers ask this to see if the candidate can identify “ascertainment bias” in study designs. If a data scientist only analyzes “celebrities” to find the relationship between talent and attractiveness, they will find a negative correlation because those who lack both are simply not celebrities. The solution is to ensure the sample is representative of the general population, not just a truncated subset.

15. What is Imputation and the Theory of Missing Data

Handling missing data is a daily task, and interviewers ask about it to evaluate the candidate’s understanding of “missingness” mechanisms. There are three primary types :

MCAR (Missing Completely at Random): The probability of data being missing is the same for all observations. Deleting these rows is safe and does not introduce bias.

MAR (Missing at Random): The probability of missingness is related to observed data (e.g., women are less likely to report their weight). We can use other variables to predict and impute the missing values.

MNAR (Missing Not at Random): The probability of missingness depends on the value of the missing data itself (e.g., people with low income are less likely to report it). This is the most dangerous form and requires sophisticated modeling or data collection changes.

The “ideal” answer critiques simple imputation methods (like filling with the mean) for reducing variance and distorting correlations. Instead, the candidate should advocate for methods like K-Nearest Neighbors (KNN) or Multiple Imputation by Chained Equations (MICE) which maintain the statistical distribution of the feature.

Conclusion

Mastering these 15 concepts won’t just help you clear interviews. It builds the statistical intuition you need to make sound decisions with real data. The gap between success and failure often comes down to understanding assumptions, variance, and how data is generated, not just running models.

As automated ML tools handle more of the coding, the real edge lies in thinking clearly. Spotting Simpson’s Paradox or correctly estimating Minimum Detectable Effect is what sets strong candidates apart.

If you’re preparing for interviews, strengthen these foundations with our free Data Science Interview Prep course and practice the concepts that actually get tested.

I am a Data Science Trainee at Analytics Vidhya, passionately working on the development of advanced AI solutions such as Generative AI applications, Large Language Models, and cutting-edge AI tools that push the boundaries of technology. My role also involves creating engaging educational content for Analytics Vidhya’s YouTube channels, developing comprehensive courses that cover the full spectrum of machine learning to generative AI, and authoring technical blogs that connect foundational concepts with the latest innovations in AI. Through this, I aim to contribute to building intelligent systems and share knowledge that inspires and empowers the AI community.

Login to continue reading and enjoy expert-curated content.

Keep Reading for Free

What's Hot

An Unbothered Jimmy Wales Calls Grokipedia a ‘Cartoon Imitation’ of Wikipedia

How to Design an Agentic Workflow for Tool-Driven Route Optimization with Deterministic Computation and Structured Outputs

All 9 avalanche victims recovered from California’s Sierra Nevada, sheriff says

How to Design an Agentic Workflow for Tool-Driven Route Optimization with Deterministic Computation and Structured Outputs

At the World’s Largest General Science Meeting, Surviving Trump Is the Topic

A Guide to Multi-Agent AI Systems

How to Design a Swiss Army Knife Research Agent with Tool-Using AI, Web Search, PDF Analysis, Vision, and Automated Reporting

Amazon SageMaker AI in 2025, a year in review part 2: Improved observability and enhanced features for SageMaker AI model customization and hosting

A Galaxy Composed Almost Entirely of Dark Matter Has Been Confirmed

An Unbothered Jimmy Wales Calls Grokipedia a ‘Cartoon Imitation’ of Wikipedia

How to Design an Agentic Workflow for Tool-Driven Route Optimization with Deterministic Computation and Structured Outputs

All 9 avalanche victims recovered from California’s Sierra Nevada, sheriff says

An Unbothered Jimmy Wales Calls Grokipedia a ‘Cartoon Imitation’ of Wikipedia

How to Design an Agentic Workflow for Tool-Driven Route Optimization with Deterministic Computation and Structured Outputs

All 9 avalanche victims recovered from California’s Sierra Nevada, sheriff says

Usefull link

categories

What's Hot

15 Probability & Statistics Interview Questions

Core Probability Foundations

Q1. What is Bayesian Inference and the Monty Hall Paradox?

Q2. What is The Poisson vs. Binomial Distribution Dilemma

Q3. Explain The Law of Large Numbers and The Gambler’s Fallacy

Statistical Inference & Hypothesis Testing

Q4. What is the Central Limit Theorem and Statistical Robustness

5. P-Values and the Null Hypothesis Significance Testing (NHST) Framework

6. Type I vs. Type II Errors and the Trade-off of Power

7. What is the difference between Confidence Intervals and Prediction Intervals in Forecasting

Experimental Design & A/B Testing

8. Sample Size Determination for A/B Testing

9. What is the difference between Stratified Sampling and Variance Reduction?

10. What is the difference between Parametric vs. Non-Parametric Testing

Statistical Learning & Model Generalization

11. Explain the Bias-Variance Trade-off and Model Complexity

12. What is the difference between L1 (Lasso) vs. L2 (Ridge) Regularization

13. What is Simpson’s Paradox and the Dangers of Aggregation

14. What is Berkson’s Paradox and Selection Bias

15. What is Imputation and the Theory of Missing Data

Conclusion

Login to continue reading and enjoy expert-curated content.

Related Posts

Usefull link

categories