-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Hi, I had to do some optimizations recently for zonal stats on a very large number of polygons, and I thought the method might fit well here.
Essentially, instead of iterating thru each feature one at time, I use the rasterio.features.rasterize function to label each pixel of a raster with an id for all features. Then some array reshaping is used to create a sorted masked array, and all stats can then be generated in a vectorized fashion. It offers huge performance boosts when feature counts become large (see figure below). I setup just a few stats (['mean','max','min','sum','std']) to compare with the original zonal method just as a proof of concept.
Some downsides of this:
- It creates a copy of the original raster array, so will be more memory intensive with larger rasters.
- It will load a larger chunk of the raster, also using more memory than 1 feature at a time.
- it won't work with overlapping polygons as is.
- rasterizing all features at once means a pixel can only be assigned to a single feature, whereas before a pixel split by 2 polygons could be "shared". I suspect this is the cause in some slight differences in the final stats, but in my comparisons the differences are very tiny.
You can see the code here https://github.com/sdtaylor/python-rasterstats/tree/zonal_refactor_poc
And a testing/comparison script here below.
@perrygeo any interest in having this in rasterstats somehow?
Test code
import numpy as np
import rasterstats
import rasterio
from shapely.geometry import Point, MultiPoint, box, MultiPolygon
from shapely.ops import voronoi_diagram
import pandas as pd
from timeit import default_timer as timer
test_tif = './tests/data/slope.tif'
def generate_random_polygons(n, bounds, seed=10):
"""
Use random points and voronoi diagrams to create
numerous polygon shapes completely covering everything
within the bounds
Parameters
----------
n : int
number of polygons.
bounds : tuple
(minx, miny, maxx, maxy)
seed : numeric
random seed
Returns
-------
[Polygon]
List of polygons
"""
minx, miny, maxx, maxy = bounds
rng = np.random.default_rng(seed)
points_x = rng.uniform(minx, maxx, n)
points_y = rng.uniform(miny, maxy, n)
bounding_box = box(*bounds)
point_geom = MultiPoint([Point(x,y) for x,y in zip(points_x, points_y)])
geoms = [g.intersection(bounding_box) for g in voronoi_diagram(point_geom).geoms]
return geoms
with rasterio.open(test_tif) as src:
tif_profile = src.profile
tif_bounds = src.bounds
#----------------
# Time both methods
stats = ['mean','max','min','sum','std']
timing_results = []
comparison_results = {s:[] for s in stats}
for n_polygons in [10,50,100,500]:
print(f'testing with {n_polygons} polygons')
geoms = generate_random_polygons(n=n_polygons, bounds=tif_bounds, seed=None)
# Drop some geoms so the raster is not 100% covered
geoms = geoms[5:]
original_start = timer()
original_method = list(rasterstats.main.gen_zonal_stats(geoms, test_tif, stats=stats, all_touched=False))
original_end = timer()
original_total = round(original_end - original_start,5)
exp_start = timer()
exp_method = rasterstats.main.gen_zonal_stats_experment(geoms, test_tif, stats=stats, all_touched=False)
exp_end = timer()
exp_total = round(exp_end - exp_start, 5)
timing_results.append({
'n_polygons' : n_polygons,
'original_time' :original_total,
'experimental_time' : exp_total
})
# Compare the difference in final numbers between the original method
# and new method.
for i,o,e in zip(range(len(original_method)), original_method, exp_method):
for s in stats:
if o[s] is None and e[s] is None:
comparison_results[s].append(0)
else:
comparison_results[s].append(o[s] - e[s])
pd.DataFrame(timing_results).plot(x='n_polygons', y=['original_time','experimental_time'],ylabel='seconds', kind='bar')
for stat, differences in comparison_results.items():
print(f'{stat} largest abs difference: {np.abs(differences).max()}')
