Skip to content

Using rasterstats with multigeometries on very large datasets #214

@yyu1

Description

@yyu1

Hi perrygeo,
I have used rasterstats with some very large raster data (100m resolution global data) using zonal_stats, and I have ran into issues. I was able to find the problem and made some modifications on my fork of rasterstats that seems to be working.

However, I didn't think about if it might break other intended functions of rasterstats, so I wanted to open an issue here to discuss what are your thoughts and your intended usage types for python-rasterstats.

Typically, when I wanted to do some type of zonal statistics on very large raster sets, I write my own C++ code using boost::accumulators. I think that is really the best way to calculate statistics on such large datasets, and it runs circles around python in terms of speed. However, I thought I would give rasterstats a try since I have used it on some smaller datasets.

I ran into the following problem:
Running zonal_stats would have the python memory usage balloon up until the kernel kills the process. Activity monitor was showing memory usage > 100Gb.

Digging into the code further, I found what was happening was due to MultiPolygon types in the vector file. It happens in this portion:

geom = shape(feat['geometry'])
if 'Point' in geom.type:
    geom = boxify_points(geom, rast)
geom_bounds = tuple(geom.bounds)
fsrc = rast.read(bounds=geom_bounds)

The problem is, let's say I have a very large raster data (mine is about 130 Gb), and there is a MultiPolygon feature that has polygons spread out around almost the entirety of this raster.
In this case geom.bounds is going to give a giant rectangle that includes all the polygons, even though the actual area of coverage for those polygons are not that big.
The code then tries to read all that into memory, and later on, it gets even worse as the code tries to make nodata and NAN masks of the same size. This approach is never going to work with large rasters with MultiPolygons spread out.

The way I fixed this was to iterate through each polygon in a MultiPolygon, read the raster into numpy array, flatten that array into 1D, go to next polygon, do the same, and append the 1D array to the end. The flattening is to make the append operation possible, and we don't really need the relative positions of the pixels in 2D space. The same is done for the nodata array and NAN array.
At the end, just go back to the same code to calculate the statistics on the 1D array.

The questions now is:

  1. Does the flattening break other intended uses of zonal_stats?
  2. Does it even make sense to do this for python-rasterstats? Perhaps this use case is better suited for a C/C++ program.

Hope you can provide some feedback regarding this approach and your intended usage for python-rasterstats.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions