Parallelization
Embarrassingly Parallel Problem
Sorcha’s design lends itself perfectly to parallelization – when it simulates a large number of Solar System objects, each one is considered in turn independently of all other objects. If you have access to a large number of computing cores, you can run Sorcha much more quickly by dividing up the labor: giving a small part of your model population to each core.
This involves two subtasks: breaking up your model population into an appropriate number of input files with unique names and organizing a large number of cores to simultaneously run Sorcha on their own individually-named input files. Both of these tasks are easy in theory, but tricky enough in practice that we provide some guidance below.
Slurm
Slurm Workload Manager is a resource management utility commonly used by computing clusters. We provide starter code for running large parallel batches using Slurm, though general guidance we provide is applicable to any system. The documentation for Slurm is available here. Please note that your HPC (High Performance Computing) facility’s Slurm setup may differ from those on which Sorcha was tested, and it is always a good idea to read any facility-specific documentation or speak to the HPC maintainers before you begin to run jobs.
Quickstart
We provide as a starting point our example scripts for running on HPC facilities using Slurm. Some modifications will be required to make them work for your facility.
Below is a very simple Slurm script example designed to run the demo files three times on three cores in parallel. Here, one core has been assigned to each Sorcha run, with each core assigned 1Gb of memory.
#!/bin/bash
#SBATCH --job-name=three_jobs_in_one
#SBATCH --account=lazy_slurm_user
#SBATCH --partition=the_bestest_partition
#SBATCH --ntasks=3
#SBATCH --mem-per-cpu=1G
#SBATCH --time=24:00:00
#SBATCH --output=log-%a.log
dt=$(date '+%d/%m/%Y %H:%M:%S');
echo "$dt Beginning Sorcha."
srun --exclusive -N1 -n1 sorcha run -c ./sorcha_config_demo.ini --pd ./baseline_v2.0_1yr.db --ob ./sspp_testset_orbits.des -p ./sspp_testset_colours.txt -o ./ -t testrun_e2e_1 -st testrun_e2e_stats_1 &
srun --exclusive -N1 -n1 sorcha run -c ./sorcha_config_demo.ini --pd ./baseline_v2.0_1yr.db --ob ./sspp_testset_orbits.des -p ./sspp_testset_colours.txt -o ./ -t testrun_e2e_2 -st testrun_e2e_stats_2 &
srun --exclusive -N1 -n1 sorcha run -c ./sorcha_config_demo.ini --pd ./baseline_v2.0_1yr.db --ob ./sspp_testset_orbits.des -p ./sspp_testset_colours.txt -o ./ -t testrun_e2e_3 -st testrun_e2e_stats_3 &
wait
dt=$(date '+%d/%m/%Y %H:%M:%S');
echo "$dt Sorcha complete."
Please note that time taken to run and memory required will vary enormously based on the size of your input files, your input population, and the chunk size assigned in the Sorcha configuration file - we therefore recommend test runs before you commit to very large runs. The chunk size is an especially important parameter; too small and Sorcha will take a very long time to run, too large and the memory footprint may become prohibitive. We have found that chunk sizes of 1,000 to 10,000 work best.
Below is a more complex example of a Slurm script. Here, multi_sorcha.sh calls multi_sorcha.py, which splits up an input file into a number of ‘chunks’ and runs Sorcha in parallel on a user-specified number of cores.
multi_sorcha.sh:
#!/bin/bash
#SBATCH --job-name=the_best_job
#SBATCH --account=im_a_power_user
#SBATCH --partition=the_best_partition
#SBATCH --mem=all_of_it
#SBATCH --time=24:00:00
#SBATCH --output=log-%a.log
python3 multi_sorcha.py --config my_config.ini --input_orbits my_orbits.csv --input_physical my_colors.csv --pointings my_pointings.db --path ./ --chunksize $1 --cores $2 --instance ${SLURM_ARRAY_TASK_ID} --cleanup
multi_sorcha.py:
import os
import astropy.table as tb
from multiprocessing import Pool
import pandas as pd
import sqlite3
import numpy as np
def run_sorcha(i, args, pointings, instance, config, stats):
print(f"sorcha run -c {config} --pd {pointings} -o {args.path}{instance}/ -t {instance}_{i} --ob {args.path}{instance}/orbits_{i}.csv -p {args.path}{instance}/physical_{i}.csv --st {stats}_{i}", flush=True)
os.system(f"sorcha run -c {config} --pd {pointings} -o {args.path}{instance}/ -t {instance}_{i} --ob {args.path}{instance}/orbits_{i}.csv -p {args.path}{instance}/physical_{i}.csv --st {stats}_{i}")
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--input_orbits', type=str)
parser.add_argument('--input_physical', type=str)
parser.add_argument('--path', type=str)
parser.add_argument('--chunksize', type=int)
parser.add_argument('--cores', type=int)
parser.add_argument('--instance', type=int)
parser.add_argument('--cleanup', action='store_true')
parser.add_argument('--pointings', type=str)
parser.add_argument('--stats', type=str)
parser.add_argument('--config', type=str)
args = parser.parse_args()
chunk = args.chunksize
instance = args.instance
pointings = args.pointings
path = args.path
config = args.config
stats=args.stats
orbits = tb.Table.read(args.input_orbits)
orbits = orbits[instance*chunk:(instance+1)*chunk]
orb_splits = np.array_split(range(len(orbits)), args.cores)
physical = tb.Table.read(args.input_physical)
physical = physical[instance*chunk:(instance+1)*chunk]
phys_splits = np.array_split(range(len(physical)), args.cores)
os.system(f'mkdir {instance}')
os.system(f'cp {pointings} {instance}/')
for i in range(args.cores):
sub_orb = orbits[orb_splits[i]]
sub_phys = physical[phys_splits[i]]
sub_orb.write(f"{args.path}{instance}/orbits_{i}.csv", overwrite=True)
sub_phys.write(f"{args.path}{instance}/physical_{i}.csv", overwrite=True)
with Pool(processes=args.cores) as pool:
pool.starmap(run_sorcha, [(i, args, pointings, instance, config, stats) for i in range(args.cores)])
data = []
for i in range(args.cores):
data.append(pd.read_sql("select * from sorcha_results", sqlite3.connect(f"{args.path}{instance}/{instance}_{i}.db")))
data = pd.concat(data)
data.to_sql("sorcha_results", sqlite3.connect(f"{args.path}output_{instance}.db"))
if args.cleanup:
os.system(f"rm {args.path}{instance}/*")
os.system(f"rmdir {args.path}{instance}")
Note
We provide these here for you to copy, paste, and edit as needed. You might have to some slight modifications to both the Slurm script and multi_sorcha.py, for example if you're using Sorcha without calling the stats file.
multi_sorcha.sh requests many parallel Slurm jobs of multi_sorcha.py, feeding each a different --instance parameter. After changing ‘my_orbits.csv’, ‘my_colors.csv’, ‘my_pointings.db’, ‘my_config.ini’, and the various Slurm parameters to match the above, for a file of 1000 objects you could generate 10 jobs with 4 cores running 25 orbits each, as follows:
sbatch --array=0-9 multi_sorcha.sh 100 4
You can run multi_sorcha.py on the command line as well:
python multi_sorcha.py --config sorcha_config_demo.ini --input_orbits mba_sample_1000_orbit.csv --input_physical mba_sample_1000_physical.csv --pointings baseline_v2.0_1yr.db --path ./ --chunksize 1000 --cores 4 --instance 0 --stats mbastats --cleanup
This will generate a single output file. It should work fine on a laptop, and be a bit (but not quite 4x) faster than the single-core equivalent due to overheads.
Note
This ratio improves as input file sizes grow. Make sure to experiment with different numbers of cores to find what’s fastest given your setup and file sizes.
Sorcha’s Helpful Utilities
Sorcha comes with a tool designed to combine the results of multiple runs and the input files used into tables on a SQL database. This can make exploring your results easier. To see how to use this tool, on the command line, run:
sorcha outputs create-sqlite –-help
Sorcha also has a tool designed to search for and check the logs of a large number of runs. This tool can make sure all runs completed successfully, and output to either the terminal or a .csv file the names of the runs which have not completed and the relevant error message, if applicable. To see the usage of this tool, on the command line run:
sorcha outputs check-logs –-help
Best Practices/Tips and Tricks
We strongly recommend that HPC users download the auxiliary files needed to run the ASSIST+REBOUND into a known, named directory, and use the --ar command line flag in their
sorcha runcall to pointSorchato those files. You can download the auxiliary files using:sorcha bootstrap --cache <directory>
And then run
Sorchavia:sorcha run … --ar /path/to/folder/
This is because Sorcha will otherwise attempt to download the files into the local cache, which may be on the HPC nodes rather than in your user directory, potentially triggering multiple slow downloads.
We recommend that each
Sorcharun be given its own individual output directory. If multiple parallelSorcharuns are attempting to save to the same file in the same directory, this will cause confusing and unexpected results.Sorchaoutput files can be very large, and user directories on HPC facilities are usually space-limited. Please ensure that yourSorcharuns are directing the output to be saved in a location with sufficient space, like your HPC cluster’s scratch drive.Think about having useful, helpful file names for your outputs. It is often tempting to call them something like “sorcha_output_<number>” or “sorcha_output_<taskid>”, but hard-won experience has led us to instead recommend more explanatory names for when you come back to your output later.
Tip
You can use the sorcha init command to copy Sorcha's example configuration files into a directory of your choice.