int overflow for big variant lines in vcf

We have a variant with more than 50 alleles for almost 500000 samples. The line in question is over 3 gb in size.
bcftools and our own program that uses htslib can not read it.
After debugging I noticed, at least for .vcf instead of .vcf.gz, that hts_getline seemed to read it but it returned the length read which was over 3gb and since the returned value was captured in an int (4 bytes), (function _reader_fill_buffer in synced_bcf_reader.c)
it resulted in an overflow giving a negative number which was interpreted as failure.
Bigger datasets becoming more common in the future this problem will occur more often.
I´m sure this overflow problem can be found at more places.
I would consider it important to solve this and allow for bigger lines to handle the big data that we have today and even bigger in the near future. Changing to 8 byte integers would solve this. Note, size_t is used at many places, which handles this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

int overflow for big variant lines in vcf #1539

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

int overflow for big variant lines in vcf #1539

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions