In our landscape-scale research we commonly work with raster data. These often come in the form of 25x25m grid cells. Now that’s a lot of data!
A problem arises when we try to analyse such data – it becomes too big of a problem for our little (or even not so little) computers to handle.
In this blog I’ll briefly discuss three options to make these problems more manageable: sampling, aggregation, and super-computing.
Simply put, this is taking some of the observations, and leaving out others. It can be really simple – e.g. random sampling – and it is super easy to tweak the sample size from really small (e.g. when you are just writing and testing your code) to larger sample sizes for a full analysis. However, some great questions to ask might be: 1) should I stratify my sample? and 2) how big should my sample be?
Stratified random sampling can help to ensure a representative sample across groups in your data. What groups, you ask? Well, this is where you need to know your data. Explore your covariates with descriptive stats and histograms, or even do a simple cluster analysis to help visualise groups and major gradients in your data. You’ll want to stratify across these, randomly sampling within the groups. Remember, you’re aiming for a representative sample.
The samples should be big enough, but not too big. This will depend on your data, and what you are trying to do with it. If you are just trying to get your code working, as small as possible is usually fine. If you are past this stage, then the sample number will depend on the variability in your data, possibly the size of the effect you are seeking (think power analysis), and what you want to do with it. You want to take a big enough (random) sample that a different sample will give you a similar answer (see Resampling for robustness). You don’t want to take too many samples, if that forces you to use a supercomputer unnecessarily, or you have to wait weeks for a result that could run in a few hours to give a similar answer. If you are doing a matching analysis, you will likely need maybe 10x the number of ‘control’ samples as ‘treatment’ samples to ensure adequate matching can occur.
Rather than leaving out some samples, aggregation combines groups of samples into summary metrics at a larger resolution. You might want to do this if you are interested in the summary metrics themselves, e.g. if you want to describe what is happening over a larger area. These areas may be larger grid cells, or may be polygons that better represent processes in your region.
But problems of aggregation can arise when you want to combine this information with other layers, particularly at different or coarse resolutions. Rather than binary data (0/1) the new data is often percents (e.g. 51% forest cover), or mixed categories (forest/grass), or simply not representative anymore of the true nature of the cell (e.g. if it was 51% forest and 49% grass, and is allocated “forest”). When you layer that on top of other data, say for soils, the coarser the data the harder it becomes to attribute e.g. forest conversion to a particular soil type. This non-binary data can be a lot more complex to analyse and interpret.
Big servers and the cloud can be used to crunch big numbers. And this is sometimes necessary, for example with spatial-temporal data across landscape scales, and other such problems where the complexity is really irreducible. However, using these resources can involve a learning curve, more code, and you may have to wait your turn. It may be necessary for your final analysis, but you probably want to be confident that your code will work, and give you the results that you need, before launching off to this avenue.