How to speed up creating dataframe faster for large dataset

Hi,
I am creating dataframe for 3.5m records and 25 vector. it is taking over 1min.

```ruby
# construct data for 3.5m records and close to 25 same key element in each hash.
data = [
  {m: 'abc', a: 1.2, b: 2.1, c: 2.3},
  {m: 'xyz', a: 1.1, b: 22.1, c: 223.3}
  ...
]

# Convert from array of hash to hash of array
vc = {}
data.first.keys.each do |ky|
  vc[ky] = data.map{|dt| dt[ky]}
end

Benchmark.bm do |x|
  x.report("df array_of_hash: ") { Daru::DataFrame.new(data, clone: false) }
  x.report("df hash_of_array: ") { Daru::DataFrame.new(vc, clone: false) }
end

##
#                              user     system      total        real
# df array_of_hash:   86.398855   0.311986  86.710841 ( 86.850770)
# df hash_of_array:   21.745897   0.027261  21.773158 ( 21.814447)
```

After converting data (which also took a min), it is little faster but 21 sec is still a lot of time to create dataframe.

Any ideas how to speed this up?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to speed up creating dataframe faster for large dataset #546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to speed up creating dataframe faster for large dataset #546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions