genDPOdata

Attributes

`seed`
`data_source`

Functions

`preprocess_data`(examples[, max_words])	Preprocesses input examples to extract and truncate conversation components.
`gen_mixed_preference_data`(data_source, sample_size, split)	Generates a mixed dataset with samples from multiple preference data sources.

Module Contents

genDPOdata.seed = 6

genDPOdata.preprocess_data(examples, max_words=50)

Preprocesses input examples to extract and truncate conversation components.

Parameters:

examples (dict) – A batch of input examples containing ‘chosen’ and ‘rejected’ texts.
max_words (int, optional) – Maximum number of words to retain in each conversation component. Defaults to 50.

Returns:

Processed dictionary containing prompts, chosen responses, and rejected responses.

Return type:

dict

Example

>>> # Define data source preferences
>>> data_source = {"harmless": 0.5, "helpful": 0.5}
>>> sample_size = 2000
>>> ds_mix = gen_mixed_preference_data(data_source, sample_size, split="train")

genDPOdata.gen_mixed_preference_data(data_source, sample_size, split)

Generates a mixed dataset with samples from multiple preference data sources.

Parameters:

data_source (dict) – A dictionary specifying sources and weights, which supports {“harmless”: p, “helpful”: 1-p} format
sample_size (int) – The total sample size to generate.
split (str) – The dataset split to use, e.g., ‘train’ or ‘test’.

Returns:

A mixed dataset with samples from specified data sources.

Return type:

Dataset

genDPOdata.data_source