-
Notifications
You must be signed in to change notification settings - Fork 459
Description
We have a variant with more than 50 alleles for almost 500000 samples. The line in question is over 3 gb in size.
bcftools and our own program that uses htslib can not read it.
After debugging I noticed, at least for .vcf instead of .vcf.gz, that hts_getline seemed to read it but it returned the length read which was over 3gb and since the returned value was captured in an int (4 bytes), (function _reader_fill_buffer in synced_bcf_reader.c)
it resulted in an overflow giving a negative number which was interpreted as failure.
Bigger datasets becoming more common in the future this problem will occur more often.
I´m sure this overflow problem can be found at more places.
I would consider it important to solve this and allow for bigger lines to handle the big data that we have today and even bigger in the near future. Changing to 8 byte integers would solve this. Note, size_t is used at many places, which handles this.