Deduplicate Metadata

While toying over this data set, I noticed about a 4x redundancy (1.5MB to 384KB) in the metadata. It's probably not a big deal, but I figured I'd leave this script somewhere in case it becomes useful.

**Current Format:**
```
<id>: {
  "image_filepath": "images/<id>.jpg",
  "anomaly_class": <class>
},
...
```

**Reduced Format:**
```
<id>: <class>,
...
```

This changes the images filepath requirement to consistent with the top level key instead of the `image_filepath` key, which is already the case in the current dataset. The provided script will fail if this constraint is not satisfied.

```ruby
require 'json'
require 'pathname'

def shrink(old_path)
  old = File.open(old_path) { |f| JSON.load(f) }
  new = old.map do |key, value|
    if key != File.basename(value["image_filepath"], ".jpg")
      raise "key (#{key}) / filepath (#{value["image_filepath"]}) mismatch"
    end
    [key, value["anomaly_class"]]
  end.to_h
  new_path = Pathname(old_path).sub_ext("_new.json").to_s
  File.open(new_path, 'w') { |f| JSON.dump(new, f) }
end

shrink("./InfraredSolarModules/module_metadata.json")
```

Further reduction in size and random access time could be achieved by assuming a contiguous set of image paths and then using the offset into the metadata to index into them directly. This could prevent loading the entire set of metadata if it grows too large.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate Metadata #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deduplicate Metadata #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions