-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
Describe the bug
In fms-hf-tuning==3.0.0 the construction of handler_fn_kwargs1 in
fms-hf-tuning/tuning/data/setup_dataprocessor.py
Lines 291 to 311 in d8cb1cb
| # First data handler configuration | |
| handler_fn_kwargs1 = { | |
| "dataset_text_field": data_args.dataset_text_field, | |
| "conversation_column": data_args.dataset_text_field, | |
| } | |
| handler_kwargs1 = { | |
| "fn_kwargs": handler_fn_kwargs1, | |
| "remove_columns": None, | |
| } | |
| handlers.append( | |
| DataHandlerConfig("apply_tokenizer_chat_template", arguments=handler_kwargs1) | |
| ) | |
| # Second data handler configuration | |
| handler_fn_kwargs2 = { | |
| "fields_name": { | |
| "dataset_text_field": data_args.dataset_text_field, | |
| "dataset_image_field": data_args.dataset_image_field, | |
| }, | |
| "processor_kwargs": processor_kwargs, | |
| } |
The issue is that apply_tokenizer_chat_template method changed its args in fms-hf-tuning=3.0.0 :
- from
dataset_text_fieldtoformatted_text_column_name - from
conversation_columntoconversation_column_name
However handler_fn_kwargs1 still uses dataset_text_field and conversation_column.
Platform
Please provide details about the environment you are using, including the following:
- Interpreter version: Python 3.11
- Library version: 3.0.0
Sample Code
A tuning job for a vision model e.g.
['python', 'sft_trainer.py', '--log_level', 'info', '--eval_strategy', 'no', '--save_strategy', 'no', '--learning_rate', '1e-05', '--weight_decay', '0.0', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--include_tokens_per_second', 'True', '--packing', 'False', '--gradient_accumulation_steps', '4', '--gradient_checkpointing', 'True', '--max_steps', '-1', '--num_train_epochs', '10.0', '--stop_after_seconds', '-1.0', '--model_name_or_path', 'llava-hf/llava-v1.6-mistral-7b-hf', '--per_device_train_batch_size', '4', '--torch_dtype', 'bfloat16', '--max_seq_length', '1024', '--training_data_path', 'my_dataset.parquet', '--output_dir', 'output/raytune-1.0.2.dev21+8efd40a.dirty-nevergrad-2d9ac5/0-006c2aa3', '--use_flash_attn', 'True', '--optim', 'adamw_torch', '--bf16', 'False', '--dataset_text_field', 'output', '--dataset_image_field', 'images', '--remove_unused_columns', 'False', '--dataset_kwargs', '{"skip_prepare_dataset": true}', '--gradient_checkpointing_kwargs', '{"use_reentrant": false}', '--peft_method', 'none']
Expected behavior
The job should complete successfully.
Observed behavior
Traceback (most recent call last):
File "$site-packages//tuning/sft_trainer.py", line 360, in train
) = process_dataargs(
^^^^^^^^^^^^^^^^^
File "$site-packages//tuning/data/setup_dataprocessor.py", line 564, in process_dataargs
train_dataset, eval_dataset, dataset_text_field = _process_raw_data_args(
^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//tuning/data/setup_dataprocessor.py", line 441, in _process_raw_data_args
train_dataset, eval_dataset = data_processor.process_dataset_configs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//tuning/data/data_processors.py", line 608, in process_dataset_configs
train_dataset, eval_dataset = self._process_dataset_configs(dataset_configs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//tuning/data/data_processors.py", line 499, in _process_dataset_configs
raw_datasets = self._execute_data_handlers(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//tuning/data/data_processors.py", line 370, in _execute_data_handlers
return self.__execute_map_data_handler(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//tuning/data/data_processors.py", line 314, in __execute_map_data_handler
processed_ds[split_name] = ds.map(handler.op, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//datasets/arrow_dataset.py", line 3171, in map
for rank, done, content in iflatmap_unordered(
File "$site-packages//datasets/utils/py_utils.py", line 728, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "$site-packages//datasets/utils/py_utils.py", line 728, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "$site-packages//multiprocess/pool.py", line 774, in get
raise self._value
TypeError: apply_tokenizer_chat_template() missing 1 required positional argument: 'formatted_text_column_name'
Metadata
Metadata
Assignees
Labels
No labels