Week 1-2: First Attempt at trajectory streaming using bash scripting

24 Jun 2025    

In the first two weeks of my GSoC project, I focused on setting up a basic solution for streaming WESTPA trajectories. In WESTPA, simulations are run using a bash script - runseg.sh. A separate script is launched by the WESTPA work manager for each weighted ensemble segment. The basic idea was to modify this script to both run the simulation and a python script that would receive the trajectory data and process it in real-time. The simulation is run in the background and instructed to use the interactive molecular dynamics (IMD) protocol to stream the trajectory data over an assigned port. A basic outline of the script using GROMACS for an NaCl system is as follows:

PORT=$((WEST_CURRENT_SEG_ID +10000))

gmx mdrun -deffnm seg -imdwait -imdport $PORT &

#Calculate pcoord with MDAnalysis
python $WEST_SIM_ROOT/common_files/get_distance_streaming.py $PORT
cat dist.dat > $WEST_PCOORD_RETURN

The -imdwait flag tells GROMACS to wait for the IMD client to connect before starting the simulation, and the -imdport flag specifies the port on which the IMD client will connect. The gromacs simulation is run in the background using the & operator, allowing the analysis script to execute while the simulation is running. In this case, the assigned port is simply calculated by adding 10000 to the current segment ID, which is unique for each segment in WESTPA. Whilst this ensures that each segment has a unique port, it is not a robust solution as it assumes that these ports are not already in use by another process.

An outline of the get_distance_streaming.py script is as follows:

import MDAnalysis as mda
from MDAnalysis.analysis.distances import self_distance_array
import numpy as np
from imdclient.IMD import IMDReader
import sys

port = int(sys.argv[1])
u = mda.Universe('bstate.gro', f"imd://localhost:{port}")
ag = u.select_atoms('not water')

dist = []
for ts in u.trajectory:
    dist.append(self_distance_array(ag)[0])
np.savetxt("dist.dat", np.array(dist))

As a client, IMDClient expects that the port is already open and listening for connections. In this current form, the script will likely fail as the port is not open when the script is executed. Therefore, a small modification to the get_distance_streaming.py script is needed to ensure that the port is open before the IMD client attempts to connect. This is shown below:

import socket
import time
import sys

port_open = False
timeout=0.2
port=int(sys.argv[1])
host="localhost"
sock = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
while not port_open:
    try:
        sock.connect((host, port))
    except ConnectionRefusedError:
        time.sleep(timeout)
    else:
        print(f"Port {port} on {host} is now open!")
        port_open = True

This simple code snippet attempts to connect to the specified port every 0.2 seconds until it succeeds, indicating that the port is open and ready for connections. Once the connection is established, the script proceeds to run the IMD client and process the trajectory data.

Issues with this approach and next steps

As mentioned above, this current setup does not smartly assign ports to segments, which could lead to the weighted ensemble simulation crashing if other processes are using the same ports. To address this, a more robust port assignment strategy is needed. The initial idea was to use WESTPA to assign ports to segments. After some investigation, this is not as straightforward as it seems, as (as far as I am aware) python does not have a robust way to get all the open ports on a system. In addition, there is no way to guarantee that the port will remain open in the time between WESTPA assigning the port and the molecular dynamics simulation connecting to it.

After discussion with my mentors, Oliver Beckstein, Jeremy Leung and Lawson Woods, we decided that the best approach would be one where the molecular dynamics simulation first connects to the port, and then communicates the port back to WESTPA/the analysis script. This would ensure that the analysis script and the molecular dynamics simulation are always using the same port, and that the port is open. In its current form, it is not possible to communicate information back to WESTPA when the segments are running, so we cannot use WESTPA to communicate the port to the analysis script. Therefore, it was suggested to use a python script to run both the molecular dynamics simulation and the analysis. This script would first start the molecular dynamics simulation, check which port the molecular dynamics simulation is using, and finally run the analysis script. Wrappers for GROMACS (such as GromacsWrapper) can be used to launch the GROMACS simulation and retrieve the port number from the output.

I have also attempted to integrate this workflow further into WESTPA, by modifying executable.py in WESTPA propagators. The aim would be to use the WESTPA work manager to launch analysis processes alongside the simulation processes. I will discuss this in more detail in my next post.