Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using AMR and passive scalars with user-defined boundary functions #609

Open
3 tasks done
sunnywong314 opened this issue Jul 31, 2024 · 12 comments
Open
3 tasks done

Comments

@sunnywong314
Copy link

Prerequisite checklist

Place an X in between the brackets on each line as you complete these checks:

  • Did you check that the issue hasn't already been reported?
  • Did you check the documentation in the Wiki for an answer?
  • Are you running the latest version of Athena++?

Summary of issue
I am running my own problem generator (based on code by @c-white) where supernova ejecta enters the box (as a user boundary condition) and interacts with a companion star. When running with AMR, the code exits with segmentation fault (with the input provided, at cycle 332, code time 0.38). I am relatively new to Athena++ so any help/pointers are greatly appreciated.

Steps to reproduce

Configure:
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg -mpi -hdf5

Compile and run:
make clean; make
mpirun -n 40 bin/athena -i inputs/model.athinput time/tlim=2.0

Input files (placed in input folder, and please remove .txt) :
donor_1p0_0p21_4p0_0.08.data.txt,
model.athinput.txt

Version info

  • Athena++ version: 24.0
  • Compiler and version: g++ 11.4.0
  • Operating system: Rocky Linux 8.10
  • Hardware and cluster name (if applicable):
  • External library versions (if applicable): openmpi/4.0.7 , hdf5/mpi-1.10.9
@tomidakn
Copy link
Contributor

We cannot tell what is causing the problem without seeing your code. I suggest you to run the code with gdb (or analyze dumped core file with it) to identify where it died. It is a bit tricky to run gdb with MPI but you can google it.

@sunnywong314
Copy link
Author

My apologies, I forgot to attach my problem generator
test_eos_ejecta3.cpp.txt

I will look into running gdb with MPI. Thank you for the suggestion.

@tomidakn
Copy link
Contributor

I could not catch anything causing the segmentation fault, but I'm afraid that your boundary conditions probably cause another problems. You are directly accessing the passive scalar array in the boundary functions, but with AMR we need to apply the boundary conditions on what we call coarser buffer for AMR prolongation.

@c-white @felker do you remember the correct way to apply the boundary conditions on the scalar variables?

@felker
Copy link
Contributor

felker commented Jul 31, 2024

https://github.com/PrincetonUniversity/athena/wiki/Passive-Scalars#compatibility-with-other-code-features

However, User-Defined Boundary Conditions are currently unsupported for NSCALARS > 0 since there is no AthenaArray<Real> &r parameter in the function signature

This cannot be hacked in the code in the way shown in the attached pgen file.

@felker felker changed the title Segmentation fault when using AMR Segmentation fault when using AMR and passive scalars with user-defined boundary functions Jul 31, 2024
@felker
Copy link
Contributor

felker commented Jul 31, 2024

To elaborate, the user-defined boundary functions get called during the prolongation step of refinement in ApplyPhysicalBoundariesOnCoarseLevel(), under function callstacks at sites like this one:

if (nb.ni.ox1 == 0) {
if (apply_bndry_fn_[BoundaryFace::inner_x1]) {
DispatchBoundaryFunctions(pmb, pmr->pcoarsec, time, dt,
pmb->cis, pmb->cie, sj, ej, sk, ek, 1,
ph->coarse_prim_, pf->coarse_b_,
pnrrad->coarse_ir_, pcr->coarse_cr_,
BoundaryFace::inner_x1,
bvars_subset);
}

You'll note that ph->coarse_prim_ and other refinement-specific variable buffers are what are being used here / what the user-defined boundary condition is being applied to, not always ph->w. So that is why hardcoding in your boundary condition functions like:

	  AthenaArray<Real> &prim_scalar = pmb->pscalars->r;

	  prim_scalar(0,k,j,i) = 0.0;
	  prim_scalar(1,k,j,i) = 0.0;

won't work. The function needs to be made generic enough to apply to a function parameter for e.g. ps->coarse_r_.

Or you can follow @yanfeij's lead in #492 and have separate user-defined boundary functions for passive scalars like he made for the radiation intensity:

athena/src/bvals/bvals.cpp

Lines 667 to 690 in 185473d

void BoundaryValues::DispatchBoundaryFunctions(
MeshBlock *pmb, Coordinates *pco, Real time, Real dt,
int il, int iu, int jl, int ju, int kl, int ku, int ngh,
AthenaArray<Real> &prim, FaceField &b, AthenaArray<Real> &ir,
AthenaArray<Real> &u_cr, BoundaryFace face,
std::vector<BoundaryVariable *> bvars_subset) {
if (block_bcs[face] == BoundaryFlag::user) { // user-enrolled BCs
pmy_mesh_->BoundaryFunction_[face](pmb, pco, prim, b, time, dt,
il, iu, jl, ju, kl, ku, NGHOST);
// user-defined boundary for radiation
if ((NR_RADIATION_ENABLED || IM_RADIATION_ENABLED)) {
pmy_mesh_->RadBoundaryFunc_[face](pmb,pco,pmb->pnrrad,prim,b, ir,time,dt,
il,iu,jl,ju,kl,ku,NGHOST);
}
if (CR_ENABLED) {
pmy_mesh_->CRBoundaryFunc_[face](pmb,pco,pmb->pcr,prim, b, u_cr,time,dt,
il,iu,jl,ju,kl,ku,NGHOST);
}
}
// KGF: this is only to silence the compiler -Wswitch warnings about not handling the
// "undef" case when considering all possible BoundaryFace enumerator values. If "undef"
// is actually passed to this function, it will likely die before that ATHENA_ERROR()

Since your user-defined boundary functions are mostly outflow, I would try hardcoding calling the built-in outflow functions for only the passive scalars in void BoundaryValues::DispatchBoundaryFunctions and call your user-defined function on the other variables.

@c-white
Copy link
Contributor

c-white commented Jul 31, 2024

Thanks @tomidakn and @felker -- this was sort of working with an earlier version of the codebase, but perhaps that was just luck. At least there are a couple ways forward for fixing the passive scalar boundaries. @sunnywong314 I can help pursuing one of them. For this project, I'm inclined to do some quick and dirty pointer comparisons, so that only the pgen file needs to be modified, but I'll spend a little time having a closer look at the code.

@sunnywong314
Copy link
Author

@tomidakn @felker @c-white Many thanks for looking into this!

Passive scalars didn't cause the segmentation fault, but it is good to know that the hack in the boundary function doesn't work.

I removed all passive-scalar-related lines from the problem generator for clarity :
test_eos_ejecta3.cpp.txt
and scaled down the problem so that it runs faster
model.athinput.txt

I get a segmentation fault if I configure with :
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg -mpi
make clean; make
and run with :
mpirun -n 20 bin/athena -i inputs/model.athinput time/tlim=2

However, if I configure without MPI:
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg
and run with bin/athena -i inputs/model.athinput
then the segmentation fault goes away.

The segmentation fault also goes away if I configure with the -debug option with MPI still on:
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg -mpi -debug
and run with mpirun -n 20 bin/athena -i inputs/model.athinput time/tlim=2

I tracked down 9d763ac as the first commit that gave me the segmentation fault. All earlier commits that I tested up to 2bd7c69 from Mar 2021 were alright.

I haven't learned how to run a debugger with MPI so I don't know which line of the code gave me the segmentation fault.

Here are the modules I have:

  1. modules/2.2-20230808 (S) 2) slurm (S) 3) gcc/11.4.0 4) openmpi/4.0.7 5) hdf5/mpi-1.10.9

The modules are the same at compile time and at run time.

mpicxx --version :
g++ (Spack GCC) 11.4.0

@tomidakn
Copy link
Contributor

tomidakn commented Aug 5, 2024

OK, it sounds like my fault. I'll take a look.

Can you try it with nghost=2?

@sunnywong314
Copy link
Author

nghost = 2 still gives the segmentation fault (note: previous runs used xorder = 3, and for this I change to xorder = 2)

@tomidakn
Copy link
Contributor

tomidakn commented Aug 9, 2024

I could reproduce your issue with g++ (8.5.0) + Intel MPI, but not with icpc (2023) + Intel MPI. So this issue seems to be g++ specific.

@c-white
Copy link
Contributor

c-white commented Aug 9, 2024

@sunnywong314 To try this on Popeye:

module load modules/2.3-20240529 intel-oneapi-compilers/2024.1.0 intel-oneapi-mpi/intel-2021.12.0 hdf5/intel-mpi-1.14.3
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg --cxx icpc -mpi -hdf5 --mpiccmd mpiicpx

In your submission script, try either srun or mpirun. Hopefully this runs smoothly, and it might even run faster.

@tomidakn
Copy link
Contributor

I tested the latest code with g++ and Intel MPI but with another pgen and it ran smoothly. So I'm afraid this issue is very subtle but specific to your pgen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants