Handbook 1

–

Topic C

Random Sampling: practical considerations

summary

This is some text inside of a div block.

A quick note: the examples and numbers in this topic assume a perfect world free of bias or constraints, which is impossible in the real world. However, use this topic as a way to understand & communicate the power of random sampling. Once you can understand how and why random sampling is a powerful sampling approach, you can work backwards to implement it appropriately in your research.

‍

Mathematical Randomness

To talk about random sampling, let’s start with what "random" means in statistics. Let's take a simple example: children’s blocks.

Imagine there are four different shaped block in a box: a circle, a triangle, a square, and a diamond. There's exactly one block of each shape. If you were blindfolded and grabbed one block, what are the chances of getting a circle block? A square block?

If it's one block per shape, that means you have a 1-in-4 or 25% chance of selecting any type of block. If every shape has the same chance of being selected, then the shape you pick up is random.

This example with the children’s blocks uses simple random sampling, a special type of random sampling. Random sampling is a sampling technique where you select participants from a sampling frame with one core conditions: every element or person has a known, non-zero chance of being selected, meaning everyone can actually be selected, contacted, or recruited for your research. Simple random sampling (abbreviated SRS), is when every element has the same, non-zero chance of being selected. SRS is a special type of random sampling.

Random sampling and random assignment sometimes get mixed up. Random sampling and random assignment are two separate ideas: random sampling is about how participants are selected from a sampling frame, while random assignment is how participants are placed/assigned into a control or variant group for experiments.

Random sampling is possibly the most powerful sampling technique, for reasons covered later in this topic. But to use it properly in your research, you’ll have to meet certain requirements.

‍

Random Sampling Requirements

Above, you read about the first requirement of random sampling: every person has a known probability of being selected. Additional requirements are listed below:

Random Sampling Requirements

‍‍Having a complete sampling frame (see Topic 2 in this handbook for more on sampling frames)
Having a mechanism, process, or tool that can actually select participants randomly (or without bias)
Every person in your sampling frame has a known, non-zero chance of being selected
All selections are independent of each other (aka selecting participant 1 doesn’t affect selecting participant 2)

You can take those requirements and visualize them like the grid above. This grid has a clear border, with each cell representing a different person. Every person has a known cell and choosing one cell won't affect how you select another. Keep this idea in your head because it'll help you make sense of other ideas in this phase.

What happens if you meet all of the requirements for random sampling? What makes random sampling powerful?

‍

The Power of Random

When you use random sampling, you're essentially randomly picking numbers from a sampling frame. If you number everyone in a sampling frame from one onward and then use a random number generator to select a desired number of participants, that’s random sampling. The numbers were selected without bias (aka without your judgment or involvement). The final set of numbers stands a higher chance of being representative of your population or segment than if you selected people manually (assuming your sampling frame isn’t biased or inadequate). But beyond representativeness, the real power of random sampling comes when you need to make inferences back to your population.

Remember that inferences happen when you use your sample statistics as a way to estimate a population parameter. Random sampling allows you to make very precise estimates with only a fraction of your population or segment. For example, if you randomly sampled 400 people on what fruit they liked, you could estimate what a population of 100,000 people might prefer and theoretically be wrong by a few percentage points.

Numbers are taken from the Yamane sample size formula. See this article for more.

To illustrate how powerful random sampling can be, let’s expand on the numbers from above. Let’s look at two different populations: one from a small island and the other from a large city. The island has 500 residents while the city has 100,000. If you wanted to estimate what proportion of the small island likes a particular fruit within ±5%, you’d need 222 survey respondents. That’s about 44% of the entire island population.

But if you wanted to estimate the same for the large city within ±5%, then you’d only need 400 respondents or only 0.4% of the entire city for the same level of precision. Put another way, for a population twenty times larger, you only need 178 more respondents for the same amount of precision.

Once again, this example is assuming a perfect situation. But the logic beneath the numbers holds: random sampling allows you to use a smaller sample size without sacrificing precision.

A lot of formulas in statistics also assume that you used random sampling (such as when creating confidence intervals or visualizing the sampling distribution of the mean). If you did in fact randomly sample, you could rely on a lot of mathematical assumptions to make better analytical interpretations. You can see other reasons to use random sampling below.

When To Use Random Sampling

_‍_‍You’re running quantitative research
You want a defined or expected margin-of-error for your results
You want a representative sample
ou want to generalize your sample findings back to your population

You can't do any better than a random sample when you think about resource and time optimization. In theory, if you can randomly sample, then you should. But in practice, random sampling can be hard to even try.

‍

Random Sampling Isn't Easy

One of the most challenging requirements to use random sampling is a complete sampling frame. But you probably don’t have access to a frame anywhere near complete. To use random sampling, couldn't you randomly sample from an incomplete list – like a list of customer emails the marketing team gave you? While this seems like a good approach, it won’t generate a random sampling representative of your population or segment.

Random sampling only reflects the patterns and characteristics of the sampling frame you used. If you have a biased or skewed sampling frame, random sampling will give you a biased, unrepresentative sample.

Another issue with random sampling is when your sampling frame, which should reflect your population/segment-of-interest, is very diverse. Look at the diagram above. Imagine if you had limited time (which you probably do) and only one chance to sample from the frame to answer your research questions. If your sampling frame – which should be reflective of your population – isn’t diverse (like on the left), your sample should roughly capture that uniformity.

But if your sampling frame (and population or segment that frame represents) have a lot of small, minority groups, your random sample might never include anyone from these groups. When you’re dealing with a diverse segment, using a purposeful, non-random approach (covered more in the next topic) to make sure you hear from unique voices is a better approach.

Random sampling reflects or mimics the patterns and characteristics of the sampling frame you used.

Finally, just because you need 385 people to respond to estimate something with a +/- 5% margin-of-error, that doesn't mean you can get all 385 to respond. In fact, you might have to contact a thousand people before getting 385 informative participants in your sample. If enough people don't respond or participate, your sample will never reflect your population or segment, even if you’ve used random sampling.

Does that mean you can't ever use random sampling? Not necessarily. Recognize that it's not very likely for you to stumble across a complete sampling frame. A complete sampling frame is a list that contains all the contact information for everyone in your population or segment. It doesn't – and will never - exist unless you try to create it.

If you know most of your population are bananas, then you'd want the same proportion of bananas in your sampling frame. Yes, this isn't perfect, but randomly sampling from this list sampling frame will give you more valid results than impulsively using random sampling on whatever sampling frame(s) you have.

Random sampling is only as useful as the sampling frame you're selecting from. It doesn't automatically guarantee that your results from using it will apply to your entire population or segment. You have to think critically about where you're sampling from and how likely your frame represents the population or segment you care about.

You can also review government census data, product usage analytics, and other data sources where you can effectively understand how your population or segment thinks, behaves, and feels. Over time, you can slowly build your sampling frame. It won't be perfect, but it'll increase the generalization of your research.

With all the requirements, reasons, and warnings about random sampling out of the way, let's end this topic by looking at the different types of random sampling you could use.

‍

Types of Random Sampling

Let's take a look at different random sampling techniques, alongside when you might want to use each one.

This is the basic assumption for random sampling: every person is in one specific, contained list or table (aka your sampling frame). Every cell represents an informative participant, and every cell can be selected without affecting any other cells. Use random sampling for your quantitative research whenever you can.

Simple random sampling (SRS) is the most basic, most powerful, and most challenging random sampling technique. In this grid above, everyone has a 1/25th chance of being selected (when using a random number generator to select participants that match specific numbers). The math around random sampling is built assuming someone takes an SRS approach. However, SRS can be challenging and too generic for your specific research needs. Let’s review the other forms of random sampling to find techniques that work better for your context.

Stratified random sampling takes the entire grid and then groups people based on common characteristics or traits. These groups are known as strata. Examples of strata could be all iOS owners, all tropical fruits, or everyone who likes a specific sports team. Once everyone is put into a particular stratum, you randomly select people from each stratum. Stratify on relevant or important characteristics based on your research questions. Otherwise, your groups (and following data analysis and reporting) will be meaningless. In the majority of situations, using a stratified random sampling will give you the best sample possible without missing out on data from smaller, marginalized groups.

Cluster random sampling is like the approach above, but it's different for one main reason: it uses geography or natural groups. You cluster or group people based on their location. For example, you might cluster people based on their household or neighborhood when randomly sampling homes to test for water quality. Anyone in the house or the neighborhood already exists in a clearly defined cluster. Avoid cluster sampling if the clusters are very different from each other. Use this approach if one of your research questions care about geography, proximity, or distance.

‍

Sampling More Than Once

The sampling techniques above only have you sample once from your sampling frame. Let’s end this topic by looking at two techniques where you sample multiple times: multistage and multiphase random sampling. To examine both, let’s focus on an example research question: How fast does the public WIFI feel to residents on Fruitful Island?

Multistage random sampling is a more complex version of cluster and stratified random sampling. You take the location or population you care about, split it into meaningful groups, and randomly select a group to sample from again. Each time, you’re sampling from a smaller and smaller stage or number of possible selectable elements (or a smaller grid as shown in the diagram above). You repeat this process as many times as needed. As you keep repeating this process, the grids get more uniform. You collect data from the very last grid (known as a final sampling unit or FSU).

For the public WIFI example, a multistage random sampling technique could mean creating a list that places every fruit in a district, city, and neighborhood on the island (see above diagram). You then randomly select a district. From there, you randomly select one city from that district. Finally, you randomly select a neighborhood from that chosen city. You would collect data about the public WIFI from this final chosen neighborhood. Use multistage random sampling when you have a very large, very complex geographical location to study.

Multiphase random sampling is similar, but you collect data after every new random sample. You also decide who/what to randomly sample and when to keep sampling. With the WIFI example, you could randomly sample all of the districts on the island and have the fruits take a survey about their public WIFI experience. You then randomly sample all the neighborhoods schools where the WIFI was rated poorly. Then at each of these poor WIFI neighborhoods, you randomly sample a few neighborhoods to test the WIFI.

Multiphase sampling is more purposeful, cost-effective, and precise than multistage sampling. Use multiphase random sampling if you want a more flexible approach, have many variables to consider, and need (and have the time and resources to collect) more and more granular data.

If you need an in-depth reference to learn more about each of these random sampling techniques, check out this public chapter from the helpful book Sampling Essentials: Practical Guidelines for Making Sampling Choices by Johnnie N. Daniel, an American statistics professor and author.

Think of random sampling as an effective filter when planning your quantitative research: if you can't use random sampling, then you might want to consider approaches to lower the bias. But are there any reasons to use non-random sampling in your research?

Simple random sampling
Probability sampling
Stratified Sampling
Disproportionate allocation; proportionate allocation
Disproportionate allocation for between and within-strata analyses

Resources

Resources locked during public beta.