Simulated data analyses¶
Adapting methodology from Burt et al., 2020, NeuroImage, we ran two comprehensive simulation analyses to assess how accurate the different null frameworks are when there is a known “ground truth.” Here, we briefly describe the generation of the simulated brain maps and associated analyses used to generate results shown in the manuscript.
Note that all simulation analyses can be run with the make simulations
command run from the root of the repository.
Visualization of empirical results can be run with the make plot_empirical
command, and supplementary analyses can be run with the make suppl_empirical
command.
Warning
Although we provide the make simulations
command to be consistent with other aspects of the data processing and analysis, we strongly encourage you not to run the simulation analyses this way.
The simulation analyses are very computationally-/time-intensive, and we estimate using make simulations
will require at least 1 month of computation time.
Below we provide some batch scripts that can be used to run the simulations on an HPC cluster to significantly speed things up.
Creating the simulated brain maps¶
The first step in this whole process is to create simulated brain maps. Critically, we need to (1) be able to control the degree of spatial autocorrelation in the maps, (2) project the maps to the cortical surface (for use with spatial permutation null methods), and (3) specify the correlation between pairs of maps.
To solve the first problem—controlling spatial autocorrelation—we use code generously provided by Burt and colleagues (2020, NeuroImage).
To solve the second problem—projecting the maps to the cortical surface—we adapt this code, which was used to generate 2-dimensional Gaussian random fields, to make 3D fields instead.
We then pretend hese 3D fields are simple volumetric brain images, and use FreeSurfer’s mri_vol2surf
to get cortical representations of the maps.
The final problem—specifying the correlation between pairs of maps—is a bit trickier. We can try and brute force it, generating random maps over and over until we get two that have a set correlations, and while this will succeed quite readily when the amount of spatial autocorrelation in the maps is high, at low levels of spatial autocorrelation (or not spatial autocorrelation!) achieving our target correlation value of 0.15 would be nigh impossible. As such, we use a multivariate normal distribution to create two random vectors that are correlated to our target r, and then use the above procedure to convert them into spatially-autocorrelated brain maps on the surface. If the generated brain maps remain correlated to our target r we use them in the rest of the analysis, and if they don’t we discard and try again with a different set of correlated random vectors.
Code for generating simulated brain pairs can be found in scripts/simulations/generate_simulations.py
.
Testing the null models¶
You might notice that there are two scripts in the scripts/simulations
directory that seem wildly similar (scripts/simulations/run_simulated_nulls_serial.py
and scripts/simulations/run_simulated_nulls_parallel.py
), and you’d be right: the scripts are nearly identical.
Their primary difference is in what level we are parallelizing the computation of the null maps.
The run_simulated_nulls_parallel.py
script applies the null models to the simulated data, running different simulations in parallel (that is, there is no parallelization at the level of null map generation).
This is most useful for null frameworks where running the simulation is the most costly computational step (e.g., for the Vázquez-Rodríguez, Baum, Cornblath, Váša, Hungarian, Moran, and naive non-parametric methods).
On the other hand, run_simulated_nulls_serial.py
script applies the null models to the simulated data, where the generation of null maps is done in parallel (that is, the simulations are run serially).
This is most useful for null frameworks where generating the null maps for a given method is the most costly computational step (i.e., for the Burt-2018 and Burt-2020 methods).
Running null models “in serial”¶
Below we provide the outline of a batch script for submission to a SLURM job scheduler on an HPC for use with null models that are best run “in serial” (i.e., the Burt-2018 and Burt-2020 models).
Because these models take so long to run, we recommend using array jobs to split them up into smaller chunks that can be run in parallel (and also running the different spatial autocorrelation levels in parallel, too).
Note that in the below script you will have to specify the $ALPHA
and $SPATNULL
variable, and submit a separate job for each of these variable combinations.
(You could change these variables to parameters and use a separate script to iterate through and submit the different combinations.)
The time, CPU, and memory requirements can be modified as you see fit, but we find the following combinations work moderately well (and should avoid memory errors):
Burt-2018: 3 days, 24 CPUs, 48G
Burt-2020: 3 days, 24 CPUs, 96G
#!/bin/bash
#SBATCH --time=3-00:00
#SBATCH --cpus-per-task=24
#SBATCH --mem=96G
#SBATCH --array=0-950:50
# which null model / spatial autocorrelation combo are we running?
ALPHA=alpha-0.0
SPATNULL=burt2020
# run the script
SINGULARITYENV_OPENBLAS_NUM_THREADS=1 SINGULARITYENV_MKL_NUM_THREADS=1 \
singularity run --cleanenv -B ${PWD}:/opt/spatnulls --pwd /opt/spatnulls \
${PWD}/container/markello_spatialnulls.simg \
python scripts/simulations/run_simulated_nulls_serial.py \
--n_perm 10000 --n_proc ${SLURM_CPUS_PER_TASK} --seed 1234 \
--alpha ${ALPHA} --spatnull ${SPATNULL} \
-- ${SLURM_ARRAY_TASK_ID} 50
You’d then need to iterate over all the spatial autocorrelation parameters (alpha-0.0
through alpha-3.0
) and the different null models you wanted to run this way (burt2018
, burt2020
) and re-submit jobs for each (and then watch as your HPC allocation priority vanishes into thin air as your jobs queue up and process!).
Running null models “in parallel”¶
Below we provide the outline of a batch script for submission to a SLURM job scheduler on an HPC for use with null models that are best run “in parallel” (i.e., the Vázquez-Rodríguez, Baum, Cornblath, Váša, Hungarian, Moran, and naive non-parametric models).
Because these methods don’t take nearly so long to run as those referenced above, we do not recommend using an array job for these.
Note, however, that you will still have to change the $ALPHA
and $SPATNULL
variables and submit a separate job for each of these variable combinations (or you can increase the time limit and run multiple at once by providing multiple frameworks to the --spatnull
and --alpha
parameters).
The time, CPU, and memory requirements can be modified as you see fit, but we find the following combinations work moderately well (and should avoid memory errors):
Vázquez-Rodríguez, Baum, Váša, Hungarian, naive non-parametric: 3 hours, 12 CPUs, 24G
Cornblath: 12 hours, 24 CPUs, 24G
Moran: 24 hours, 12 CPUs, 24G
#!/bin/bash
#SBATCH --time=3:00:00
#SBATCH --cpus-per-task=12
#SBATCH --mem=24G
# which null model / spatial autocorrelation combo are we running?
ALPHA=alpha-0.0
SPATNULL=burt2020
# run the script
SINGULARITYENV_OPENBLAS_NUM_THREADS=1 SINGULARITYENV_MKL_NUM_THREADS=1 \
singularity run --cleanenv -B ${PWD}:/opt/spatnulls --pwd /opt/spatnulls \
${PWD}/container/markello_spatialnulls.simg \
python scripts/simulations/run_simulated_nulls_parallel.py \
--n_perm 10000 --n_proc ${SLURM_CPUS_PER_TASK} --seed 1234 \
--alpha ${ALPHA} --spatnull ${SPATNULL} -- 1000
Visualizing simulation results¶
Once the simulations analyses have been generated the results can be plotted using the command make plot_simulations
.
(Note that this assumes you have already run make simulations
!)
This will save out a lot of different individual plots, which we combined into the results shown in Figures 2, 3 and S1-5 in the manuscript.
Supplementary analyses¶
We tested a few extra things in our analyses to see how minor variations in implementation of different null models impacted the results. Namely, we used simulated data to assess the relative runtimes of the different null models, and how the size of the null distribution created by each model impacts the accuracy of the statistical estimates they generate.
You can generate these supplementary results with the command make suppl_simulations
.
(Note that this assumes you have already run make simulations
!)