Skip to content

Deduplicate Metadata #2

@nixpulvis

Description

@nixpulvis

While toying over this data set, I noticed about a 4x redundancy (1.5MB to 384KB) in the metadata. It's probably not a big deal, but I figured I'd leave this script somewhere in case it becomes useful.

Current Format:

<id>: {
  "image_filepath": "images/<id>.jpg",
  "anomaly_class": <class>
},
...

Reduced Format:

<id>: <class>,
...

This changes the images filepath requirement to consistent with the top level key instead of the image_filepath key, which is already the case in the current dataset. The provided script will fail if this constraint is not satisfied.

require 'json'
require 'pathname'

def shrink(old_path)
  old = File.open(old_path) { |f| JSON.load(f) }
  new = old.map do |key, value|
    if key != File.basename(value["image_filepath"], ".jpg")
      raise "key (#{key}) / filepath (#{value["image_filepath"]}) mismatch"
    end
    [key, value["anomaly_class"]]
  end.to_h
  new_path = Pathname(old_path).sub_ext("_new.json").to_s
  File.open(new_path, 'w') { |f| JSON.dump(new, f) }
end

shrink("./InfraredSolarModules/module_metadata.json")

Further reduction in size and random access time could be achieved by assuming a contiguous set of image paths and then using the offset into the metadata to index into them directly. This could prevent loading the entire set of metadata if it grows too large.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions