Skip to content

Bug: tuning vision models raises TypeError: apply_tokenizer_chat_template() missing 1 required positional argument: 'formatted_text_column_name' #606

@VassilisVassiliadis

Description

@VassilisVassiliadis

Describe the bug

In fms-hf-tuning==3.0.0 the construction of handler_fn_kwargs1 in

# First data handler configuration
handler_fn_kwargs1 = {
"dataset_text_field": data_args.dataset_text_field,
"conversation_column": data_args.dataset_text_field,
}
handler_kwargs1 = {
"fn_kwargs": handler_fn_kwargs1,
"remove_columns": None,
}
handlers.append(
DataHandlerConfig("apply_tokenizer_chat_template", arguments=handler_kwargs1)
)
# Second data handler configuration
handler_fn_kwargs2 = {
"fields_name": {
"dataset_text_field": data_args.dataset_text_field,
"dataset_image_field": data_args.dataset_image_field,
},
"processor_kwargs": processor_kwargs,
}
breaks vision models.

The issue is that apply_tokenizer_chat_template method changed its args in fms-hf-tuning=3.0.0 :

  1. from dataset_text_field to formatted_text_column_name
  2. from conversation_column to conversation_column_name

However handler_fn_kwargs1 still uses dataset_text_field and conversation_column.

Platform

Please provide details about the environment you are using, including the following:

  • Interpreter version: Python 3.11
  • Library version: 3.0.0

Sample Code

A tuning job for a vision model e.g.

['python', 'sft_trainer.py', '--log_level', 'info', '--eval_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '1e-05', '--weight_decay', '0.0', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--include_tokens_per_second', 'True', '--packing', 'False', '--gradient_accumulation_steps', '4', '--gradient_checkpointing', 'True', '--max_steps', '-1', '--num_train_epochs', '10.0', '--stop_after_seconds', '-1.0', '--model_name_or_path', 'llava-hf/llava-v1.6-mistral-7b-hf', '--per_device_train_batch_size', '4', '--torch_dtype', 'bfloat16', '--max_seq_length', '1024', '--training_data_path', 'my_dataset.parquet', '--output_dir', 'output/raytune-1.0.2.dev21+8efd40a.dirty-nevergrad-2d9ac5/0-006c2aa3', '--use_flash_attn', 'True', '--optim', 'adamw_torch', '--bf16', 'False', '--dataset_text_field', 'output', '--dataset_image_field', 'images', '--remove_unused_columns', 'False', '--dataset_kwargs', '{"skip_prepare_dataset": true}', '--gradient_checkpointing_kwargs', '{"use_reentrant": false}', '--peft_method', 'none']

Expected behavior

The job should complete successfully.

Observed behavior

Traceback (most recent call last):
  File "$site-packages//tuning/sft_trainer.py", line 360, in train
    ) = process_dataargs(
        ^^^^^^^^^^^^^^^^^
  File "$site-packages//tuning/data/setup_dataprocessor.py", line 564, in process_dataargs
    train_dataset, eval_dataset, dataset_text_field = _process_raw_data_args(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//tuning/data/setup_dataprocessor.py", line 441, in _process_raw_data_args
    train_dataset, eval_dataset = data_processor.process_dataset_configs(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//tuning/data/data_processors.py", line 608, in process_dataset_configs
    train_dataset, eval_dataset = self._process_dataset_configs(dataset_configs)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//tuning/data/data_processors.py", line 499, in _process_dataset_configs
    raw_datasets = self._execute_data_handlers(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//tuning/data/data_processors.py", line 370, in _execute_data_handlers
    return self.__execute_map_data_handler(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//tuning/data/data_processors.py", line 314, in __execute_map_data_handler
    processed_ds[split_name] = ds.map(handler.op, **kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//datasets/arrow_dataset.py", line 3171, in map
    for rank, done, content in iflatmap_unordered(
  File "$site-packages//datasets/utils/py_utils.py", line 728, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "$site-packages//datasets/utils/py_utils.py", line 728, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$site-packages//multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: apply_tokenizer_chat_template() missing 1 required positional argument: 'formatted_text_column_name'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions