Reproducibility and Seeds

SAES supports deterministic behavior for reproducible research through random seed control.

Why Reproducibility Matters

When analyzing stochastic algorithms, reproducibility is crucial for:

  • Research validation: Others can verify your results

  • Debugging: Consistent results make it easier to identify issues

  • Comparisons: Fair comparison requires consistent conditions

  • Publication: Many journals and conferences require reproducible results

Functions with Random Seeds

The following SAES functions support deterministic execution via the seed parameter:

Bayesian Statistical Tests

Both Bayesian tests support the seed parameter for reproducibility:

from SAES.statistical_tests.bayesian import bayesian_sign_test, bayesian_signed_rank_test
import pandas as pd

data = pd.DataFrame({
    'Algorithm_A': [0.9, 0.85, 0.95, 0.9, 0.92],
    'Algorithm_B': [0.5, 0.6, 0.55, 0.58, 0.52]
})

# Deterministic results with seed
result1, _ = bayesian_sign_test(data, sample_size=5000, seed=42)
result2, _ = bayesian_sign_test(data, sample_size=5000, seed=42)
# result1 and result2 will be identical

# Same for signed rank test
result3, _ = bayesian_signed_rank_test(data, sample_size=1000, seed=123)

Histogram Plots

The HistoPlot class supports seeding for consistent jitter when handling identical values:

from SAES.plots.histoplot import HistoPlot
import pandas as pd

data = pd.read_csv("results.csv")
metrics = pd.read_csv("metrics.csv")

# Create histoplot with reproducible jitter
histoplot = HistoPlot(data, metrics, "Accuracy", seed=42)
histoplot.save_instance("Problem1", "output.png")

Best Practices

  1. Always use seeds for published research: Set explicit seeds for all random operations

  2. Document your seeds: Include seed values in your research papers and code

  3. Use different seeds for different experiments: Avoid accidentally reusing the same random sequence

  4. Version control: Include seed values in your version-controlled analysis scripts

Example: Complete Reproducible Workflow

from SAES.statistical_tests.bayesian import bayesian_sign_test, bayesian_signed_rank_test
from SAES.plots.histoplot import HistoPlot
import pandas as pd

# Load data
data = pd.read_csv("algorithm_results.csv")
metrics = pd.read_csv("metrics.csv")

# Reproducible Bayesian analysis
SEED = 42
algorithm_a = data[data['Algorithm'] == 'A']['MetricValue']
algorithm_b = data[data['Algorithm'] == 'B']['MetricValue']

comparison_data = pd.DataFrame({
    'Algorithm_A': algorithm_a.values,
    'Algorithm_B': algorithm_b.values
})

# Run Bayesian test with seed
result, samples = bayesian_sign_test(
    comparison_data,
    sample_size=5000,
    seed=SEED
)

print(f"P(A < B): {result[0]:.4f}")
print(f"P(A ≈ B): {result[1]:.4f}")
print(f"P(A > B): {result[2]:.4f}")

# Create reproducible visualization
histoplot = HistoPlot(data, metrics, "Accuracy", seed=SEED)
histoplot.save_all_instances("comparison.png")

Headless Mode for Automated Workflows

SAES can be run in headless mode (without display) for automated pipelines and CI/CD:

# Set matplotlib to use non-interactive backend
export MPLBACKEND=Agg

# Run SAES commands
python -m SAES -ls -ds data.csv -ms metrics.csv -m HV -s friedman -op results.tex
python -m SAES -bp -ds data.csv -ms metrics.csv -m HV -i Problem1 -op boxplot.png
python -m SAES -cdp -ds data.csv -ms metrics.csv -m HV -op cdplot.png

For Python scripts in headless environments:

import matplotlib
matplotlib.use('Agg')  # Must be called before importing pyplot

from SAES.plots.boxplot import Boxplot
import pandas as pd

# Your analysis code here
data = pd.read_csv("results.csv")
metrics = pd.read_csv("metrics.csv")

boxplot = Boxplot(data, metrics, "Accuracy")
boxplot.save_instance("Problem1", "output.png")