genDPOdata
Attributes
Functions
|
Preprocesses input examples to extract and truncate conversation components. |
|
Generates a mixed dataset with samples from multiple preference data sources. |
Module Contents
- genDPOdata.seed = 6
- genDPOdata.preprocess_data(examples, max_words=50)
Preprocesses input examples to extract and truncate conversation components.
- Parameters:
examples (dict) – A batch of input examples containing ‘chosen’ and ‘rejected’ texts.
max_words (int, optional) – Maximum number of words to retain in each conversation component. Defaults to 50.
- Returns:
Processed dictionary containing prompts, chosen responses, and rejected responses.
- Return type:
dict
Example
>>> # Define data source preferences >>> data_source = {"harmless": 0.5, "helpful": 0.5} >>> sample_size = 2000 >>> ds_mix = gen_mixed_preference_data(data_source, sample_size, split="train")
- genDPOdata.gen_mixed_preference_data(data_source, sample_size, split)
Generates a mixed dataset with samples from multiple preference data sources.
- Parameters:
data_source (dict) – A dictionary specifying sources and weights, which supports {“harmless”: p, “helpful”: 1-p} format
sample_size (int) – The total sample size to generate.
split (str) – The dataset split to use, e.g., ‘train’ or ‘test’.
- Returns:
A mixed dataset with samples from specified data sources.
- Return type:
Dataset
- genDPOdata.data_source