Skip to content

How to speed up creating dataframe faster for large dataset #546

@rvyas

Description

@rvyas

Hi,
I am creating dataframe for 3.5m records and 25 vector. it is taking over 1min.

# construct data for 3.5m records and close to 25 same key element in each hash.
data = [
  {m: 'abc', a: 1.2, b: 2.1, c: 2.3},
  {m: 'xyz', a: 1.1, b: 22.1, c: 223.3}
  ...
]

# Convert from array of hash to hash of array
vc = {}
data.first.keys.each do |ky|
  vc[ky] = data.map{|dt| dt[ky]}
end

Benchmark.bm do |x|
  x.report("df array_of_hash: ") { Daru::DataFrame.new(data, clone: false) }
  x.report("df hash_of_array: ") { Daru::DataFrame.new(vc, clone: false) }
end

##
#                              user     system      total        real
# df array_of_hash:   86.398855   0.311986  86.710841 ( 86.850770)
# df hash_of_array:   21.745897   0.027261  21.773158 ( 21.814447)

After converting data (which also took a min), it is little faster but 21 sec is still a lot of time to create dataframe.

Any ideas how to speed this up?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions