Skip to content

Segmentation Fault (not necessarily caused by memory usage) #286

@adamleemiller

Description

@adamleemiller

Describe the bug
Presumably on large sets of data, my script is segment faulting. Initially, we saw issues due to memory consumption and the script being killed due to OOM however we have since increased the physical memory to 32GB and added a 10GB SWAP file as well. The script ran successfully for some data sources but is now segfaulting again. The current TIFF being parsed is 4.38GB in size. There is only one polygon as we are pulling stats for an entire farm field which is approximately 82 acres in size.

Machine Type: VPS
Memory Allocation: 32GB
SWAP File Allocation: 10GB
Operating System: Ubuntu 22.04.2
Python Version: 3.10.6
PIP Version: 22.0.2
Virtual Environment: Yes
GDAL Version: 3.4.3
rasterstats Version: 0.18.0
Other Packages:
- affine==2.4.0
- attrs==22.2.0
- boto3==1.26.88
- botocore==1.29.88
- certifi==2022.12.7
- click==8.1.3
- click-plugins==1.1.1
- cligj==0.7.2
- Fiona==1.8.22
- GDAL==3.4.3
- geopandas==0.12.2
- humanfriendly==10.0
- jmespath==1.0.1
- munch==2.5.0
- mysql-connector-python==8.0.32
- numpy==1.24.2
- packaging==23.0
- pandas==1.5.3
- protobuf==3.20.3
- pyparsing==3.0.9
- pyproj==3.4.1
- python-dateutil==2.8.2
- python-dotenv==1.0.0
- pytz==2022.7.1
- rasterio==1.3.6
- rasterstats==0.18.0
- s3transfer==0.6.0
- shapely==2.0.1
- simplejson==3.18.3
- six==1.16.0
- snuggs==1.4.7
- urllib3==1.26.14

zonal_stats(filepath_shapefile, filename_image, nodata=255, band=band["index"], stats=["min", "max", "mean", "count", "sum", "std", "median", "majority", "minority", "unique", "range", "nodata", "percentile_25", "percentile_75"])

Actual Error / Output

Fatal Python error: Segmentation fault

Current thread 0x00007f1b3793d000 (most recent call first):
  File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/io.py", line 356 in read
  File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/main.py", line 175 in gen_zonal_stats
  File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/main.py", line 40 in zonal_stats
  File "/data/zonal/main.py", line 287 in do_get_statistics
  File "/data/zonal/main.py", line 130 in main
  File "/data/zonal/main.py", line 354 in <module>

Extension modules: _mysql_connector, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pyproj._compat, pyproj._datadir, pyproj._network, pyproj._geod, pyproj.list, pyproj._crs, pyproj._transformer, pyproj.database, pyproj._sync, shapely.lib, shapely._geos, shapely._geometry_helpers, osgeo._gdal, osgeo._gdalconst, osgeo._ogr, osgeo._osr, fiona._err, fiona._geometry, fiona._shim, fiona._env, fiona.schema, fiona.ogrext, fiona._crs, rasterio._version, rasterio._err, rasterio._filepath, rasterio._env, rasterio._transform, rasterio._base, rasterio.crs, rasterio._features, rasterio._warp, rasterio._io, simplejson._speedups (total: 89)
Segmentation fault (core dumped)

The script downloads the files from our storage provider then grabs the number of bands and their descriptions after which it loops through the bands and grabs the stats for each band using the same shapefile which was created from a GeoJSON object previously using geopandas:

        geojson = gpd.read_file(filepath_geojson)
        geojson.set_crs("EPSG:4326")
        crs = geojson.to_crs(3857)
        crs.to_file(filename)

It is possible that the coordinate reference system (CRS) is incorrect however I have not had an issue with other datasets using this same code (different fields but same GeoJSON).

I am still new to Python itself but I have 20 years of development experience in other languages so I am not new to programming itself. If anyone can lend a hand as to where to go from here, I would appreciate it. I need to get this working flawlessly because I have nearly 2000 datasets that need to be processed in our production environment.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions