plot_cal_winrate

Attributes

base_model_file

Functions

calculate_win_rate(→ dict)

Calculate win rates of model performance compared to a base model across specified metrics.

collect_multiple_results(→ list[dict])

Aggregate win rate results for multiple models compared to a base model and save to a JSON file.

render_latex_table(→ str)

Generate and save a LaTeX table for win rates from a list of results.

plot_helpful_vs_harmless(→ None)

Plot and save a line graph of helpful and harmless win rates vs. harmless ratios.

plot_winrate(→ None)

Generate a scatter plot comparing win rates for various models based on helpfulness and harmlessness.

plot_cLevels()

Generate a scatter plot to compare average rewards (c-level) across various model baselines.

Module Contents

plot_cal_winrate.calculate_win_rate(model_file: str, base_model_file: str, metrics: list[str] = ['perplexity', 'coherence', 'diversity', 'gpt2-harmless', 'gpt2-helpful', 'humor']) dict

Calculate win rates of model performance compared to a base model across specified metrics.

Opens JSON files for the provided models, calculates the win rate for each metric, and the standard error of each win rate.

Parameters:
  • model_file (str) – Path to the JSON file of the fine-tuned model’s generated continuations.

  • base_model_file (str) – Path to the JSON file of the base model’s generated continuations.

  • metrics (list[str], optional) – List of human values/metrics for comparison. Defaults to a standard list.

Returns:

Contains file paths, win rates for each metric, and standard errors for each metric.

Return type:

dict

Example

>>> result = calculate_win_rate("fine_tuned_model.json", "base_model.json")
>>> print(result)
Command-line usage:
>>> python script.py --model_file="fine_tuned_model.json" --base_model_file="base_model.json"
plot_cal_winrate.collect_multiple_results(model_files: list[str], base_model_file: str, file_prefix: str, metrics: list[str] = None) list[dict]

Aggregate win rate results for multiple models compared to a base model and save to a JSON file.

Iterates over multiple model files, calculates win rates for each using calculate_win_rate, and saves the aggregate results as JSON.

Parameters:
  • model_files (list[str]) – List of file paths for the fine-tuned model JSON files.

  • base_model_file (str) – Path to the JSON file for the base model.

  • file_prefix (str) – Prefix for the output JSON file name.

  • metrics (list[str], optional) – Metrics for win rate calculation. Defaults to None.

Returns:

List of win rate results for each model file.

Return type:

list[dict]

Example

>>> base_model_file = 'results/opt1.3b-Anthropic-harmless.json'
>>> harmless_ratios = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
>>> beta = 0.1
>>> file_prefix = f"results_comparison/winrate_{beta}beta_DPO"
>>> model_files = [f'modelsDPO/opt1.3b-2000sample-{beta}beta-{ratio}harmless-Anthropic-harmless.json' for ratio in harmless_ratios]
>>> win_rate_results_list = collect_multiple_results(model_files, base_model_file, file_prefix)
>>> print(win_rate_results_list)

It will also create a file file_prefix.json that contains a list of entries like this: [

{

“model-path”: “modelsDPO/soup/opt1.3b-2000sample-0.5beta-0.1soup-Anthropic-harmless.json”, “basemodel-path”: “results/opt1.3b-Anthropic-harmless.json”, “perplexity”: “0.70”, “coherence”: “0.48”, “diversity”: “0.48”, “gpt2-harmless”: “0.62”, “gpt2-helpful”: “0.54”, “humor”: “0.21”, “perplexity_SE”: “0.01”, “coherence_SE”: “0.01”, “diversity_SE”: “0.01”, “gpt2-harmless_SE”: “0.01”, “gpt2-helpful_SE”: “0.01”, “humor_SE”: “0.01”

]

plot_cal_winrate.render_latex_table(win_rate_results_list: list[dict], file_prefix: str) str

Generate and save a LaTeX table for win rates from a list of results.

Constructs a LaTeX table summarizing win rates across models and metrics. Saves the table to a .tex file for LaTeX compilation.

Parameters:
  • win_rate_results_list (list[dict]) – List of dictionaries containing win rate results.

  • file_prefix (str) – Prefix for the output LaTeX file name.

Returns:

LaTeX-formatted table as a string.

Return type:

str

plot_cal_winrate.plot_helpful_vs_harmless(win_rate_results_list: list[dict], harmless_ratios: list[float], file_prefix: str) None

Plot and save a line graph of helpful and harmless win rates vs. harmless ratios.

Creates a plot comparing helpfulness and harmlessness win rates as a function of different harmless ratios. Saves the plot as a PDF file.

Parameters:
  • win_rate_results_list (list[dict]) – List of dictionaries containing win rate results.

  • harmless_ratios (list[float]) – List of harmlessness ratio values to plot.

  • file_prefix (str) – Prefix for the output PDF file name.

plot_cal_winrate.plot_winrate() None

Generate a scatter plot comparing win rates for various models based on helpfulness and harmlessness.

This function plots a 2D scatter plot where each model entry is represented as a point, with the x-axis representing “gpt2-helpful” scores and the y-axis representing “gpt2-harmless” scores. Each baseline model is given a unique marker and color, and the function computes and displays navigation efficiency for each model (the proportion of points in the “upper right” quadrant).

A reference point for the original model is plotted, along with gridlines and shading to highlight the upper-right region, which represents favorable scores for both helpfulness and harmlessness.

Specifically, we run __main__ to obtain the following baselines and our method (MAP) and their generated result files:

“DPO(0.1)”: results_comparison/winrate_0.1beta_DPO.json “DPO(0.5)”: results_comparison/winrate_0.5beta_DPO.json “DPO-Soup(0.1)”: results_comparison/winrate_0.1beta_DPOsoup.json “DPO-Soup(0.5)”: results_comparison/winrate_0.5beta_DPOsoup.json r”MoRL with random $lambda$”: results_comparison/winrate_6scale_2valuesHH_PPO_DirichletRand.json r”MAP with feasible $lambda$”: results_comparison/winrate_6scale_2valuesHH_PPO_MapRand.json

Each file contains a list of entries like this: [

{

“model-path”: “modelsDPO/soup/opt1.3b-2000sample-0.5beta-0.1soup-Anthropic-harmless.json”, “basemodel-path”: “results/opt1.3b-Anthropic-harmless.json”, “perplexity”: “0.70”, “coherence”: “0.48”, “diversity”: “0.48”, “gpt2-harmless”: “0.62”, “gpt2-helpful”: “0.54”, “humor”: “0.21”, “perplexity_SE”: “0.01”, “coherence_SE”: “0.01”, “diversity_SE”: “0.01”, “gpt2-harmless_SE”: “0.01”, “gpt2-helpful_SE”: “0.01”, “humor_SE”: “0.01”

] We can call plot_winrate() to plot a figure titled WinRate where each entry becomes a 2D point with x-axis “gpt2-helpful” and y-axis “gpt2-harmless” Each baseline name will get a different legend in the same figure.

Returns:

None

Example

>>> plot_winrate()
plot_cal_winrate.plot_cLevels()

Generate a scatter plot to compare average rewards (c-level) across various model baselines.

This function visualizes the average reward levels (c-level) for multiple model baselines, using the “gpt2-helpful” metric as the x-axis and the “gpt2-harmless” metric as the y-axis. Each baseline has its own color and marker style for distinction. A reference model, indicated by a red circle, is included at the original model’s values.

Baselines include DPO with various ratios, DPO-Soup, and MAP/MoRL with random or feasible lambda. Each CSV file from the models contains metrics, and this function extracts the “avg” row to plot the gpt2-helpful and gpt2-harmless values.

Specifically, we make a plot that compares the average reward (c-level) using the following baselines (generated from __main__)
“DPO(0.1)”:

harmless_ratios = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for ratio in harmless_ratios:

model_files.append(f’modelsDPO/opt1.3b-2000sample-0.1beta-{ratio}harmless-Anthropic-harmless.csv’)

“DPO(0.5)”:
for ratio in harmless_ratios:

model_files.append(f’modelsDPO/opt1.3b-2000sample-0.5beta-{ratio}harmless-Anthropic-harmless.csv’)

“DPO-Soup(0.1)”:
for ratio in harmless_ratios:

model_files.append(f’modelsDPO/soup/opt1.3b-2000sample-0.1beta-{ratio}soup-Anthropic-harmless.csv’)

“DPO-Soup(0.5)”:
for ratio in harmless_ratios:

model_files.append(f’modelsDPO/soup/opt1.3b-2000sample-0.5beta-{ratio}soup-Anthropic-harmless.csv’)

r”MoRL with random $lambda$”:

all csv files under modelsPPO/random-lambda/

r”MAP with feasible $lambda$ (Our proposed)”:

all csv files under modelsPPO/MAP-lambda

Each csv file contains like this template:

Statistic,humor,gpt2-helpful,gpt2-harmless,diversity,coherence,perplexity avg,1.771,-1.509,0.315,0.871,0.39,-2.785 avg_std,0.028,0.022,0.024,0.002,0.004,0.01 50%,2.421,-1.576,0.42,0.906,0.402,-2.745 60%,2.471,-1.319,0.725,0.918,0.455,-2.654 70%,2.506,-1.036,1.05,0.928,0.51,-2.568 80%,2.529,-0.672,1.357,0.937,0.566,-2.452 90%,2.551,-0.127,1.722,0.945,0.641,-2.286 99%,2.584,1.12,2.486,0.958,0.792,-1.801

We only extract the two columns gpt2-helpful (x axis) and gpt2-harmless (y axis) under the first row “avg” and draw a plot.

Parameters:

None

Returns:

The plot is saved as a PDF file in results_comparison/fig_compare_avg_reward.pdf.

Return type:

None

Example

>>> plot_cLevels()
plot_cal_winrate.base_model_file = 'results/opt1.3b-Anthropic-harmless.json'