genDPOdata

Attributes

seed

data_source

Functions

preprocess_data(examples[, max_words])

Preprocesses input examples to extract and truncate conversation components.

gen_mixed_preference_data(data_source, sample_size, split)

Generates a mixed dataset with samples from multiple preference data sources.

Module Contents

genDPOdata.seed = 6
genDPOdata.preprocess_data(examples, max_words=50)

Preprocesses input examples to extract and truncate conversation components.

Parameters:
  • examples (dict) – A batch of input examples containing ‘chosen’ and ‘rejected’ texts.

  • max_words (int, optional) – Maximum number of words to retain in each conversation component. Defaults to 50.

Returns:

Processed dictionary containing prompts, chosen responses, and rejected responses.

Return type:

dict

Example

>>> # Define data source preferences
>>> data_source = {"harmless": 0.5, "helpful": 0.5}
>>> sample_size = 2000
>>> ds_mix = gen_mixed_preference_data(data_source, sample_size, split="train")
genDPOdata.gen_mixed_preference_data(data_source, sample_size, split)

Generates a mixed dataset with samples from multiple preference data sources.

Parameters:
  • data_source (dict) – A dictionary specifying sources and weights, which supports {“harmless”: p, “helpful”: 1-p} format

  • sample_size (int) – The total sample size to generate.

  • split (str) – The dataset split to use, e.g., ‘train’ or ‘test’.

Returns:

A mixed dataset with samples from specified data sources.

Return type:

Dataset

genDPOdata.data_source