Demo Video

This video demonstrates the Barnes-Hut N-body Simulation with MPI.

Project Overview: Barnes-Hut N-body Simulation with MPI

This project implements a Barnes-Hut algorithm for simulating gravitational interactions between N bodies using MPI (Message Passing Interface) for parallelization. Instead of calculating the O(N²) forces, the Barnes-Hut algorithm approximates groups of bodies far enough away with the center of mass and the group's total mass.

Implementation Challenges:

Parallelization: Efficiently dividing the workload across multiple MPI workers.

Communication Overhead: Managing the trade-off between parallel computation benefits and MPI communication overhead.

Optimizations: Implementing performance optimizations to reduce computational complexity.

Approach

The Barnes-Hut algorithm was implemented in two phases:

Program Flow

The program follows this structure:

Key Data Structures:

struct Body {
    int index;
    double px, py;      // Position
    double mass;
    double vx, vy;      // Velocity
    double fx, fy;      // Force (initialized to 0)
};

struct Node {
    struct Body* b;
    struct Node *NW, *NE, *SW, *SE;
    double minx, maxx, miny, maxy;
};
    

Architecture & System Details

The implementation was tested on systems with the following specifications:

System Architecture
Architecture x86_64
CPU(s) 8
Thread(s) per core 2
Core(s) per socket 4
Socket(s) 1
Model name Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80 GHz

MPI Execution Details:

To run the program with multiple MPI workers:

HWLOC_COMPONENTS=-gl mpirun -n [number_of_workers] --oversubscribe ./nbody -i input/nb-10.txt -o output/output.txt -s 1000 -t 0.35 -d 0.005

Key Findings & Optimizations

Performance Patterns:

Optimizations Implemented:

Conclusion

This project successfully implemented a Barnes-Hut N-body simulation using MPI for parallelization. The results demonstrate the trade-offs between parallelization benefits and communication overhead in distributed computing environments. Optimal performance was achieved with 2-8 workers for medium-sized datasets and 4-8 workers for large datasets, closely matching the number of available physical cores on the test system.