In my previous blog, I discussed three methods of dealing with big problems: sampling, aggregation, and supercomputing. In this blog, I’ll expand on sampling, and how it can (and should) be used to test the robustness of analyses.
Basic sampling takes a subset of the data (which may or may not be a subset of the whole population). If you analyse this data you will get one estimate of the sample/population descriptive statistics, or one estimate of a model. But how good is that estimate? If we had a different sample, would the answer be different? Probably. The question is how different. If the two samples are a lot different, then your estimates are likely not to be robust.
To get an idea of how robust results are, we can repetitively resample data, and then, for each of the resamples, calculate our statistic or model of interest, then summarise our results.
From a sample of n observations, we can resample in one of four methods:
- leave-one-out (jackknifing)
- leave-m-out (where m is a group of observations, and m<n) (jackknifing)
- separate into k sets, each with n/k observations (k-fold cross-validation)
- randomly resample, with replacement, to a total of n observations (bootstrapping)
Jackknifing and bootstrapping are commonly used methods for estimating the precision of sample statistics.
For validating models, all of the above can be used, and are often combined. Further, when developing models that will be used for prediction, using too much data can lead to overfitting: where models perform well on the data used to parameterise them, but poorly on an independent data set. Therefore, model validation may involve splitting the data into ‘training’ and ‘test’ subsets, which are used to parameterise and validate the model, typically 70% and 30% respectively. For example, a simple cross-validation of a model may repetitively sample training and test sets from the data.
From the above, it can be seen that resampling can give descriptive statistics regarding your estimate. However, it can also be used for significance testing as well, if it is tweaked a little. A permutation test creates multiple samples of the data in which the data point labels are randomly exchanged. This can give you an idea of how particular your real data is, compared to these randomised samples.
Determining which procedures you need will depend on:
- How much data you have. If you don’t have much, then bootstrapping is likely to be your preferred option. If you have loads, then k-fold can give you the confidence of relative independence of the subsamples.
- What you are trying to estimate, for example, significance testing versus model validation, as discussed above.
- How replicable you need your analysis to be. Bootstrapping relies on developing random resamples. If you need your answer to be unequivocal (the same each time) you may want to look towards a jackknifing procedure.