trainPPO

Functions

`collator`(data)	Data collator function for grouping data batches without padding.
`build_dataset`(config, tokenizer, data_name)	Builds and tokenizes the dataset for training based on specified data name.
`main`(lam_list, value_list, model_name, data_name, ...)	Main function to train a model with PPO (Proximal Policy Optimization) based on user-defined parameters.

Module Contents

trainPPO.collator(data)

Data collator function for grouping data batches without padding. PPOTrainer handles padding internally based on the tokenizer settings.

Parameters:: data (list) – List of data samples.
Returns:: A dictionary with collated data grouped by each key in the input.
Return type:: dict

trainPPO.build_dataset(config, tokenizer, data_name)

Builds and tokenizes the dataset for training based on specified data name.

Parameters:

config (PPOConfig) – Configuration for PPO training.
tokenizer (AutoTokenizer) – Tokenizer used to process and encode text data.
data_name (str) – Name of the dataset, either “Imdb” or “Anthropic-harmless”.

Returns:

A Hugging Face Dataset object with tokenized prompts for training.

Return type:

Dataset

trainPPO.main(lam_list, value_list, model_name, data_name, save_path, learning_rate=1e-06, batch_size=20, mini_batch_size=2, nepoch=1)

Main function to train a model with PPO (Proximal Policy Optimization) based on user-defined parameters.

Parameters:

lam_list (list of float) – List of lambda values for aligning specified values.
value_list (str) – Comma-separated string of values to align (or “all” for all values).
model_name (str) – Name of the model to use (e.g., “opt1.3b”).
data_name (str) – Name of the dataset to use (“Imdb” or “Anthropic-harmless”).
save_path (str) – Path to save the trained model.
learning_rate (float) – Learning rate for PPO training. Defaults to 1e-6.
batch_size (int) – Total batch size for training. Defaults to 20.
mini_batch_size (int) – Batch size for each step. Defaults to 2.
nepoch (int) – Number of training epochs. Defaults to 1.

# Example command-line usage:

>>> python trainPPO.py --model_name="opt-1.3b" --data_name="Imdb" --value_list="all" --lam_list="0.241,0.077,0.117,0.033,0.070,0.065" --learning_rate=1e-4