Intro to Statistics: Part 1: What is a Random Variable?
If you want to learn about statistics, the first thing you need to understand is the concept of a random variable. A random variable, as its name suggest, is a variable whose value occurs randomly. In other words, a random variable's value cannot be predicted ahead of time. Instead the value is determined in other ways -- for example by conducting an experiment and observing the outcome.
Here's Wikipedia's definition of a random variable:
In probability and statistics, a random variable, aleatory variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense).[1]:391 A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability, in contrast to other mathematical variables.
Don't worry if that definition doesn't quite crystalize your understanding. Just let it simmer in the back of your mind for now. You probably have an intuitive understanding of randomness and probability and, therefore, what a random variable represents conceptually. That's good enough to get the ball rolling.
Random variables are typically denoted symbolically with an uppercase letter, e.g. X. Now, first off, don't confuse this X with the algebraic x (typically denoted with a lowercase x) that we're all familiar with from algebra -- i.e a variable whose value(s) satisfy some algebraic equation, e.g.
Do NOT think of X this way. It'll only confuse you. X is not an algebraic variable in that sense. X is a random variable, so we have to adjust the way we think about it.
Aside: typically if you want to avoid confusion then you wouldn't tell people how NOT to think of something, you'd just tell them how to think about it, positively. However in this case I find it useful to make the distinction between random variables and algebraic variables, since we all come from algerbra backgrounds and thinking of X like an algebraic variable might confuse your brain. For me it was the first hurdle to clear when adjusting my thinking toward random variables.
Thinking about random variables
Think of a random variable as representing the characteristics associated with the complete set of possible outcomes for a certain type of random experiment. For example, rolling a die is a random experiment. So is flipping a coin. So is measuring the height of a random person. While the outcome of a random experiment is unpredictable, we can make informed statements about the characteristics of the experiment, such as the set of possible outcomes and the probability of those outcomes occurring.
For example, if X is a random variable that represents the rolling of a single die, then the characteristics of X are:
- the set of possible outcomes is: 1, 2, 3, 4, 5, and 6
- each outcome has equal probability of occurring, p=1/6.
If we say X represents the flipping of a coin, then the characteristics of X are:
- the set of possible outcomes is: heads and tails (or simply 0 and 1 - numbers are easier to do math with)
- each outcome has equal probability of occurring, p=0.5.
If X represents the measured height of a random person, then we know:
- the set of possible outcomes is a continuous function in which all values fall somewhere between the shortest and tallest persons alive
- the probability of each outcome is not known, but can be estimated thru experimentation
Outcomes and Probabilities
Each outcome in a random variable's complete set of possible outcomes has a probability of occurring. The probability for an individual outcome always falls between 0 and 1, where 0 means it never happens, 1 means it always happens, and everything in between means it happens that percent of the time (p=0.15 means it happens 15% of the time). For example, flipping a coin has probability of heads = 0.5 (50%) and probability of tails = 0.5 (50%). Rolling a single die has 6 outcomes, each with equal probability of occurring: 1/6.
Note that the sum of the probabilities of all possible outcomes equals 1. This is true of EVERY random variable. It's clearly illustrated by the coin flip and die roll examples:
Coin flip:
- possible outcomes: heads, tails
- P[heads] = 0.5
- P[tails] = 0.5
- P[all outcomes] = P[heads] + P[tails] = 0.5 + 0.5 = 1
Die roll:
- possible outcomes: 1, 2, 3, 4, 5, 6
- P[each outcome] = 1/6
- P[all outcomes] = P[each outcome] * number-of-outcomes = 1/6 * 6 = 1
For the random variable representing the random measurement of heights, we don't know ahead of time the probability of each outcome (the measured height), so there isn't a simple mathematical proof to show that they sum to 1. Nevertheless it is a fact about this random variable and all others that the sum of the probabilities of all possible outcomes (all possible heights) is equal to 1.
Intuitively this should make sense. Think of it this way: In any random experiment, the probability of SOMETHING happening -- of SOME outcome occurring -- is always 1 (something always happens, whatever that "something" is). That "something" consists of the complete set of possible outcomes -- no more and no less. So the sum of the probabilities of all possible outcomes must equal the probability of "something" occurring, which is always 1.
Recap
So to quickly recap what we've covered so far...
- A random variable is described by the characteristics of its complete set of possible outcomes
- Each outcome has an associated probability of occurring
- The sum of the probabilities of all possible outcomes is equal to 1