-
Notifications
You must be signed in to change notification settings - Fork 115
Description
Describe the bug
Presumably on large sets of data, my script is segment faulting. Initially, we saw issues due to memory consumption and the script being killed due to OOM however we have since increased the physical memory to 32GB and added a 10GB SWAP file as well. The script ran successfully for some data sources but is now segfaulting again. The current TIFF being parsed is 4.38GB in size. There is only one polygon as we are pulling stats for an entire farm field which is approximately 82 acres in size.
Machine Type: VPS
Memory Allocation: 32GB
SWAP File Allocation: 10GB
Operating System: Ubuntu 22.04.2
Python Version: 3.10.6
PIP Version: 22.0.2
Virtual Environment: Yes
GDAL Version: 3.4.3
rasterstats Version: 0.18.0
Other Packages:
- affine==2.4.0
- attrs==22.2.0
- boto3==1.26.88
- botocore==1.29.88
- certifi==2022.12.7
- click==8.1.3
- click-plugins==1.1.1
- cligj==0.7.2
- Fiona==1.8.22
- GDAL==3.4.3
- geopandas==0.12.2
- humanfriendly==10.0
- jmespath==1.0.1
- munch==2.5.0
- mysql-connector-python==8.0.32
- numpy==1.24.2
- packaging==23.0
- pandas==1.5.3
- protobuf==3.20.3
- pyparsing==3.0.9
- pyproj==3.4.1
- python-dateutil==2.8.2
- python-dotenv==1.0.0
- pytz==2022.7.1
- rasterio==1.3.6
- rasterstats==0.18.0
- s3transfer==0.6.0
- shapely==2.0.1
- simplejson==3.18.3
- six==1.16.0
- snuggs==1.4.7
- urllib3==1.26.14
zonal_stats(filepath_shapefile, filename_image, nodata=255, band=band["index"], stats=["min", "max", "mean", "count", "sum", "std", "median", "majority", "minority", "unique", "range", "nodata", "percentile_25", "percentile_75"])Actual Error / Output
Fatal Python error: Segmentation fault
Current thread 0x00007f1b3793d000 (most recent call first):
File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/io.py", line 356 in read
File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/main.py", line 175 in gen_zonal_stats
File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/main.py", line 40 in zonal_stats
File "/data/zonal/main.py", line 287 in do_get_statistics
File "/data/zonal/main.py", line 130 in main
File "/data/zonal/main.py", line 354 in <module>
Extension modules: _mysql_connector, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pyproj._compat, pyproj._datadir, pyproj._network, pyproj._geod, pyproj.list, pyproj._crs, pyproj._transformer, pyproj.database, pyproj._sync, shapely.lib, shapely._geos, shapely._geometry_helpers, osgeo._gdal, osgeo._gdalconst, osgeo._ogr, osgeo._osr, fiona._err, fiona._geometry, fiona._shim, fiona._env, fiona.schema, fiona.ogrext, fiona._crs, rasterio._version, rasterio._err, rasterio._filepath, rasterio._env, rasterio._transform, rasterio._base, rasterio.crs, rasterio._features, rasterio._warp, rasterio._io, simplejson._speedups (total: 89)
Segmentation fault (core dumped)
The script downloads the files from our storage provider then grabs the number of bands and their descriptions after which it loops through the bands and grabs the stats for each band using the same shapefile which was created from a GeoJSON object previously using geopandas:
geojson = gpd.read_file(filepath_geojson)
geojson.set_crs("EPSG:4326")
crs = geojson.to_crs(3857)
crs.to_file(filename)It is possible that the coordinate reference system (CRS) is incorrect however I have not had an issue with other datasets using this same code (different fields but same GeoJSON).
I am still new to Python itself but I have 20 years of development experience in other languages so I am not new to programming itself. If anyone can lend a hand as to where to go from here, I would appreciate it. I need to get this working flawlessly because I have nearly 2000 datasets that need to be processed in our production environment.
Thank you.