Skip to content

csvlink appears to hang during training #103

@ghost

Description

csvlink hangs after a few seconds with 0.0% CPU

  • python version: 3.7.3
  • environment: centos

CSV Files to Match

$ wc -l train-*
   494 train-left.csv
   481 train-right.csv

Config file

Attempting to match on 9 fields.

{
 "field_names": [
  "state",
  "email",
  "address_2",
  "address_1",
  "county",
  "postal_code",
  "city",
  "name"
 ],
 "field_definition": [
  {
   "field": "state",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "email",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "address_2",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "address_1",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "county",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "postal_code",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "city",
   "type": "String",
   "Has Missing": true
  },
  {
   "field": "name",
   "type": "String",
   "Has Missing": true
  }
 ],
 "output_file": "deduped.csv",
 "skip_training": false,
 "training_file": false,
 "sample_size": 150000,
 "recall_weight": 2
}

Command

Running csvlink with the following:

csvlink train-left.csv train-right.csv --config_file=config.json --inner_join

After an initial large cpu hit, the script settles down into a very relaxed state:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14191 somebody+  20   0  558660 143092  10144 S   0.0  0.9   0:52.45 csvlink

Am I doing something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions