Add support for files larger than 2GB#289
Conversation
oschwald
left a comment
There was a problem hiding this comment.
This looks like a good start! I had a number of minor comments. In terms of performance, I believe the bounds checks on the methods should be unnecessary given their usage and eliminating them should help reduce the overhead a bit.
|
Made the requested changes, this is the new benchmark. I'll keep working on the |
|
@oschwald should I update the script in https://github.com/maxmind/MaxMind-DB to generate big DB too? I guess it will be useful for tests and benchmarks. |
Thanks! I'll try to take a look soon.
I don't think we would want a large database in that repo. There are quite a few other projects pulling that in, and a large database would impact them. In terms of testing, it might make sense to focus on good unit-test coverage of |
Ah ok, I thought it was only used for your libraries.
Sounds good, I can cover both buffers like that. 👍 |
oschwald
left a comment
There was a problem hiding this comment.
I've only had a chance to do a cursory review, but I noticed a few things.
| throw new IllegalArgumentException("File channel has no data"); | ||
| } | ||
|
|
||
| MultiBuffer buf = new MultiBuffer(size); |
There was a problem hiding this comment.
Won't this allocate a bunch of ByteBuffers that we will immediately replace with the mmap-backed ones? I think this problem exists several other places as well, e.g., duplicate.
There was a problem hiding this comment.
Yeah, I added the private constructor that works with buffers after this and forgot to change.
There was a problem hiding this comment.
Isn't this still an issue? I'd expect something like this:
int fullChunks = (int) (size / DEFAULT_CHUNK_SIZE);
int remainder = (int) (size % DEFAULT_CHUNK_SIZE);
int totalChunks = fullChunks + (remainder > 0 ? 1 : 0);
ByteBuffer[] buffers = new ByteBuffer[totalChunks];
long remaining = size;
for (int i = 0; i < totalChunks; i++) {
long chunkPos = (long) i * DEFAULT_CHUNK_SIZE;
long chunkSize = Math.min(DEFAULT_CHUNK_SIZE, remaining);
buffers[i] = channel.map(
FileChannel.MapMode.READ_ONLY,
chunkPos,
chunkSize
);
remaining -= chunkSize;
}
return new MultiBuffer(buffers, DEFAULT_CHUNK_SIZE);
I thought I saw this fixed last time, but either it was lost in the rebase or I overlooked it.
There was a problem hiding this comment.
Think I missed it completely, fixed it now.
| throw new NullPointerException("Unable to use a NULL InputStream"); | ||
| } | ||
| final int chunkSize = Integer.MAX_VALUE; | ||
| final int chunkSize = Integer.MAX_VALUE / 2; |
There was a problem hiding this comment.
What was the motivation behind this change?
There was a problem hiding this comment.
Mostly cause of the changes in aafbae6. I made the same change in MultiBuffer too.
I was getting allocation errors trying to allocate byte[Integer.MAX_VALUE], as far as I understood because when allocating an array some memory is reserved for various metadata.
I noticed that Integer.MAX_VALUE - 8 does the trick, at least on my machine, though I don't know if every platform would have been fine so I went with half max int to be safe.
There was a problem hiding this comment.
Whatever threshold we use, we should probably use it for the decision to use a single buffer on lines 23 and 35 as well. Presumably the allocation there would have the same issue. From what I can tell, Integer.MAX_VALUE - 8 should be safe. We probably just just define this as a constant in the class.
There was a problem hiding this comment.
Actually, you should just set DEFAULT_CHUNK_SIZE to this and then use that here.
| @Test | ||
| public void testWrapValidChunks() { | ||
| ByteBuffer[] chunks = new ByteBuffer[] { | ||
| ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE), | ||
| ByteBuffer.allocateDirect(500) | ||
| }; | ||
|
|
||
| MultiBuffer buffer = MultiBuffer.wrap(chunks); | ||
| assertEquals(MultiBuffer.DEFAULT_CHUNK_SIZE + 500, buffer.capacity()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testWrapInvalidChunkSize() { | ||
| ByteBuffer[] chunks = new ByteBuffer[] { | ||
| ByteBuffer.allocateDirect(500), | ||
| ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE) | ||
| }; | ||
|
|
||
| assertThrows(IllegalArgumentException.class, () -> MultiBuffer.wrap(chunks)); | ||
| } |
There was a problem hiding this comment.
I guess these tests might be causing the failure? Quite strange as the chunk size is not max int.
Possible solution I see is move the chunks size check from wrap to the constructor and test that using a small chunk size. At that point wrap is just a oneliner. Sounds good?
There was a problem hiding this comment.
I think all the tests allocating buffers of MultiBuffer.DEFAULT_CHUNK_SIZE will need to be adjusted as we are likely hitting the MaxDirectMemorySize limit set on the JVM. This also includes testDecodeStringTooLarge below, I believe.
Your approach for wrap makes sense.
oschwald
left a comment
There was a problem hiding this comment.
Sorry, I have been pretty busy, but here is some preliminary feedback.
| throw new NullPointerException("Unable to use a NULL InputStream"); | ||
| } | ||
| final int chunkSize = Integer.MAX_VALUE; | ||
| final int chunkSize = Integer.MAX_VALUE / 2; |
There was a problem hiding this comment.
Whatever threshold we use, we should probably use it for the decision to use a single buffer on lines 23 and 35 as well. Presumably the allocation there would have the same issue. From what I can tell, Integer.MAX_VALUE - 8 should be safe. We probably just just define this as a constant in the class.
| throw new NullPointerException("Unable to use a NULL InputStream"); | ||
| } | ||
| final int chunkSize = Integer.MAX_VALUE; | ||
| final int chunkSize = Integer.MAX_VALUE / 2; |
There was a problem hiding this comment.
Actually, you should just set DEFAULT_CHUNK_SIZE to this and then use that here.
| @Test | ||
| public void testWrapValidChunks() { | ||
| ByteBuffer[] chunks = new ByteBuffer[] { | ||
| ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE), | ||
| ByteBuffer.allocateDirect(500) | ||
| }; | ||
|
|
||
| MultiBuffer buffer = MultiBuffer.wrap(chunks); | ||
| assertEquals(MultiBuffer.DEFAULT_CHUNK_SIZE + 500, buffer.capacity()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testWrapInvalidChunkSize() { | ||
| ByteBuffer[] chunks = new ByteBuffer[] { | ||
| ByteBuffer.allocateDirect(500), | ||
| ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE) | ||
| }; | ||
|
|
||
| assertThrows(IllegalArgumentException.class, () -> MultiBuffer.wrap(chunks)); | ||
| } |
There was a problem hiding this comment.
I think all the tests allocating buffers of MultiBuffer.DEFAULT_CHUNK_SIZE will need to be adjusted as we are likely hitting the MaxDirectMemorySize limit set on the JVM. This also includes testDecodeStringTooLarge below, I believe.
Your approach for wrap makes sense.
|
I think I fixed everything you pointed out. The current test failures though have me quite stumped. I tried bumping the Surefire JVM heap size but there are so conflicts with |
oschwald
left a comment
There was a problem hiding this comment.
The suggested test change may help with the CI, although I haven't gone through all the tests closely.
| } | ||
|
|
||
| int readNode(ByteBuffer buffer, int nodeNumber, int index) | ||
| int readNode(Buffer buffer, long nodeNumber, int index) |
There was a problem hiding this comment.
I think we should return long from this function. We will also need to update it replace the static decodeInteger with a decodeLong. The issue is that for 32-bit nodes, we could overflow.
3a5f79a to
da3cbac
Compare
This reverts commit d6d7acd.
|
I rebased to increase the Surefire JVM heap size but that doesn't seem to work. Though I managed to reproduce the issue locally with this: $ docker run --rm -m 5g --memory-swap 5g \
-v "$PWD":/ws -w /ws maven:3.9-eclipse-temurin-21 \
bash -lc 'export MAVEN_OPTS="-Xms512m -Xmx1g"; mvn -B -e clean test \
-Dsurefire.argLine="-Xms512m -Xmx1024m -XX:MaxMetaspaceSize=192m -XX:MaxDirectMemorySize=256m -XX:+ExitOnOutOfMemoryError" \
-Dsurefire.forkCount=1'The issues seems to be here when running Though than it fails in Think I'll go with a similar approach we used for other tests and create methods and constructor to set the chunk size and test those. |
|
I managed to make tests pass, I added some protected methods and the public ones are simple wrappers like we'd done for other methods. I removed the All in all I'm quite satisfied with the current state of the PR. |
| @Test | ||
| public void testNoIpV4SearchTreeStream() throws IOException { | ||
| this.testReader = new Reader(getStream("MaxMind-DB-no-ipv4-search-tree.mmdb")); | ||
| this.testReader = new Reader(getStream("MaxMind-DB-no-ipv4-search-tree.mmdb"), 2048); |
There was a problem hiding this comment.
It would be nice to parameterize the tests (and others) so that we are testing both SingleBuffer and MultiBuffer.
Also, we might get better test coverage of edge cases if we used a lower value for the MultiBuffer cases.
| throw new IllegalArgumentException("File channel has no data"); | ||
| } | ||
|
|
||
| MultiBuffer buf = new MultiBuffer(size); |
There was a problem hiding this comment.
Isn't this still an issue? I'd expect something like this:
int fullChunks = (int) (size / DEFAULT_CHUNK_SIZE);
int remainder = (int) (size % DEFAULT_CHUNK_SIZE);
int totalChunks = fullChunks + (remainder > 0 ? 1 : 0);
ByteBuffer[] buffers = new ByteBuffer[totalChunks];
long remaining = size;
for (int i = 0; i < totalChunks; i++) {
long chunkPos = (long) i * DEFAULT_CHUNK_SIZE;
long chunkSize = Math.min(DEFAULT_CHUNK_SIZE, remaining);
buffers[i] = channel.map(
FileChannel.MapMode.READ_ONLY,
chunkPos,
chunkSize
);
remaining -= chunkSize;
}
return new MultiBuffer(buffers, DEFAULT_CHUNK_SIZE);
I thought I saw this fixed last time, but either it was lost in the rebase or I overlooked it.
oschwald
left a comment
There was a problem hiding this comment.
Thanks for working through all of this!
|
Awesome, thanks for the reviews. 🙏 |
This in an attempt to make the library work with DBs larger than 2GB.
As discussed in #154 I started out creating a private
Bufferinterface that defines allByteBuffermethods used by the library, though usinglonginstead ofintwhere necessary.I also implemented it with
SingleBuffer, it just wraps a singleByteBufferand dispatches most method calls to that.I obviously had to make some changes to use
Bufferandlongwhere neccessary.These are the benchmarks before and after the change. There seems to be a small impact in the performance, I would consider it negligible but I'd love your feedback @oschwald. If this is good for you I'll keep with this approach.
Before:
After: