Issues with Pulsed Jet runs on Different Machines

2.5D Runs

We are doing these runs at a higher resolution now. Approximately 3.5x the resolution used by Raga. I have tried running these on alfalfa (16 procs) and bluestreak (256 procs). Below is a summary of what is going on…

Beta Alfalfa BlueStreak
Inf (hydro) completed in ~11 hrs restarted from frame 34, then from 35, then from 88 and completed (can't tell exactly what total runtime was, but looks to be ~7 hrs)
5 quits at frame 6.6 with error message (see attachment) completed in ~11.6 hrs
1 ? restarted from frame 79, now cannot get past frame 82 with restarts
0.4 ? restarted from frame 34, now cannot get past frame 61 with restarts

All the runs that I have had to restart on bluestreak quit with the same error. See the attached files/images for the standard output from the beta = 0.4 restart, and also the corresponding error messages from core 64.

3D Runs

I have been trying to do these runs on kraken on 480 procs…

Beta Kraken
Inf (hydro) completed in ~1.6 hrs
5 restarting from frame 89, but last estimated wall time before restarting was > 2 days
1 restarting from frame 51, but last estimated wall time before restarting was > 2 weeks
0.4 restarting from frame 40, but last estimated wall time before restarting was > 2 days

There are no error messages from these runs. They just start taking forever because they keep requesting restarts. I attached the standard output from the restarted beta = 5 run.

UPDATE

Turning off refinement for B-field gradients helped the run on alfalfa get past frame 6.6, so I'm trying this on the other machines as well for the runs that have not yet completed.

Attachments (4)

Download all attachments as: .zip

Comments

1. Jonathan -- 11 years ago

So the error on alfalfa is because the code is running with -check uninit which catches the use of uninitialized values and is correctly identifying a bug in the code. Line 3063 should come after line 3064 in sweep_scheme.f90. This bug affects any 2.5 D run and could produce erroneous results.

2. Jonathan -- 11 years ago

The bluestreak.4 runs looks as though processor rank 64 had a segfault? (signal 11). This should not be related to the bug seen on alfalfa as that bug should not lead to segfaults. We would need the traceback to try and narrow down the source of the segfault. It might be possible to find using the core.64 file though you will have to use the addr2line command line utility.

3. Jonathan -- 11 years ago

So the kraken run looks like the time steps are steadily decreasing… The 'max speed' reported on the standard out is unfortunately only the local max speed of grids owned by processor 0. It would be easy to have this be the max speed across the grid - but that involves a somewhat otherwise unneccesary mpi_reduce function which requires processors to synchronize… If you are not using threading I imagine it wouldn't make much of a difference (< .01%) and it would be nice to have the max speed on the entire grid reported. It does synchronize the maxspeed between root steps (so it can determine the next time step size) but it does not report this directly. But i suspect there is a region where the maximum wave speed is increasing (probably a low density - high B-field region). There was a mechanism in AstroBEAR 1.0 that artificially increased the density to lower the alfven speed but that has not yet been ported over.