diff --git a/README.md b/README.md
index ee39093..1d71e9a 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,57 @@
 **University of Pennsylvania, CIS 5650: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* ANNIE QIU
+  * [LinkedIn](https://github.com/AnnieQiuuu/Project0-Getting-Started/blob/main/www.linkedin.com/in/annie-qiu-30531921a)
+* Tested on: Windows 11, i9-12900H @2500 Mhz, 16GB, RTX 3070 Ti 8GB (Personal)
 
-### (TODO: Your README)
+## Screenshots
+### Screenshot 1
+- Coherent simulation; number of boids = 10000; block size = 128
+![](images/Coherent.gif)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Screenshot 2
+- Coherent simulation; number of boids = 500000; block size = 128
+![](images/Coherent2.gif)
+
+## Performance Analysis
+
+### Framerate change with increasing # of boids
+![](images/BoidsNum.png)
+- Descriptions: The dashed lines represent the performance without visualization, and the solid lines represent performance with visualization. The y-axis is the frames per second (FPS), and the x-axis is the number of boids. All the performances decreased as the number of boids increased. The Naive simulation is affected the most. And the scattered and coherent have a milder impact.
+- Number of Boids: 1000, 5000, 10000, 20000, 50000
+- Block size: 128
+- Neighbor check: 8 neighbors
+
+### Framerate change with increasing block size
+![](images/BlockSize.png)
+- Descriptions:  As the complexity of naive simulation is high, even the increased of the block size may not have a big improvement on performance. The y-axis is the frames per second (FPS), and the x-axis is the block size. And the scatted and coherent reach peak performance in 64 and 128 and get stable after that.
+- Number of Boids: 20000
+- Block size: 8, 16, 32, 64, 128, 256 and 512
+- Neighbor check: 8 neighbors
+
+### Framerate change with 8 vs. 27 neighbors
+![](images/Cells.png)
+- Descriptions:  The dashed lines represent the performance of checking 27 neighbors, and the solid lines represent performance of checking 8 neighbors. As the number of boids get really large, the perfomance of checking 27 neighbors decreases.  The y-axis is the frames per second (FPS), and the x-axis is the number of boids.
+- Number of Boids:  1000, 5000, 10000, 20000, 50000
+- Block size: 128
+- Neighbor check: 8 neighbors, 27 neighbors
+
+### Answers according to Analysis
+1. For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+ -  As the number of boids increases, the FPS decreases, and leads to reduced performance.
+ -  Naive simulation:  This is the slowest among the three implementations, and the FPS drops drastically as the number of boids increases. This is because the Naive Simulation requires looping through every single boid, resulting in O(N^2) complexity. As N increases, the speed becomes significantly slower.
+ -  Scattered simulation:  By using a uniform grid, the number of boids each boid has to check is reduced, so the performance drop is less severe compared to the naive implementation. . However, when the number of boids becomes very large, performance still decreases significantly due to the need to access unsorted boid data during each loop.
+ -  Coherent simulation: This is an optimized version of the scattered implementation. By sorting the position and velocity arrays, memory access times are reduced. Although the number of boids still affects performance, it has the best performance overall, with a slower rate of decline compared to the other two implementations.
+2. For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+ - The low performance at block sizes smaller than 32 is due to not having enough threads to fit into the warp. If the block size is the multiples of 32, the perfomace can run efficiently in parallel. As the block size increases to a certain point like after 64 and 128, the performance stabilizes because the GPU's resources reach their limit. While larger block sizes mean more threads per block, the GPU’s resources like registers and shared memory are limited. Therefore, I think the further increasing the block size may not lead to additional performance gains.
+ - Naive simulation: Since this implementation is highly computationally intensive, even the increased of the block size may not have a big improvement on performance.
+ - Scattered simulation: The performace reaches a peak at block sizes of 32 and 64, and then stabilizes as the block size continues to increase. It is not efficient as coherent, but much better than naive.
+ - Coherent simulation:  It has the best performance over all simulations. The performance reaches the peak when the block sizes are 64 or 128. After that, it is getting stable as the scattered simulation.
+3. For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+ - Yes. This is expected. In the coherent uniform grid, the boids are sorted within the grid cells, which allows us to check neighbors more efficiently. The most time-consuming process in the simulation is memory access. By sorting the boid data, I no longer need to check particleArrayIndices each time during the simulation. As the number of boids increases, the less global memory access pattern becomes more importamt. Therefore, as the number of Boids increases, the performance of the Coherent implementation decreases slower than Scattered. 
+ In my analysis screenshot, it shows that when the number of boids is 50000, the decrease rate of scatter is much sharper than coherent.
+4. Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+ - Check 27 neighbors: The width of the grid cell is smaller, and the number of Boid in a grid can be smaller. However, by checking 27 neighbors, the memory access can also increases which cause performance to decrease.
+- Check 8 neighbors: The width of the grid cells is larger, and the number of Boid in a grid can be bigger. However, each boid only needs to check 8 neighbors, so the number of neighbor checks is reduced. 
+- In summary, if the number of boids get really big, check 27 cells  may be slower due to increased memory access and computation. While, when boids are densely distributed, check 8 cells can be slower as each grid cell is larger and contains more boids to process.
\ No newline at end of file
diff --git a/images/BlockSize.png b/images/BlockSize.png
new file mode 100644
index 0000000..a7a9b7e
Binary files /dev/null and b/images/BlockSize.png differ
diff --git a/images/BoidsNum.png b/images/BoidsNum.png
new file mode 100644
index 0000000..2fededc
Binary files /dev/null and b/images/BoidsNum.png differ
diff --git a/images/Cells.png b/images/Cells.png
new file mode 100644
index 0000000..69d1bef
Binary files /dev/null and b/images/Cells.png differ
diff --git a/images/Coherent.gif b/images/Coherent.gif
new file mode 100644
index 0000000..9347da8
Binary files /dev/null and b/images/Coherent.gif differ
diff --git a/images/Coherent2.gif b/images/Coherent2.gif
new file mode 100644
index 0000000..a1ebbde
Binary files /dev/null and b/images/Coherent2.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..49139f1 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -17,6 +17,10 @@
 
 #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
 
+//add a flag to check 27 neighbors
+#define check27neighbors 0
+#define check8neighbors 1
+
 /**
 * Check for CUDA errors; print and exit if there was a problem.
 */
@@ -85,6 +89,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3* dev_coherentPos;
+glm::vec3* dev_coherentVel1;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -157,7 +163,14 @@ void Boids::initSimulation(int N) {
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
+//Change the cell width of the uniform grid to be the neighborhood distance
+// Now, 27 neighboring cells will need to be checked for intersection
+#if check27neighbors
+	gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif
+#if check8neighbors
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+#endif
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -169,6 +182,34 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+  // We have N boids, so we need N indices
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+   
+  // We use gridCellCount is becuase we need to know the start and end indices of each cell
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  // Coherent buffers
+  cudaMalloc((void**)&dev_coherentPos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_coherentPos failed!");
+
+  cudaMalloc((void**)&dev_coherentVel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_coherentVel1 failed!");
+
+  //cudaMalloc((void**)&dev_coherentVel2, N * sizeof(glm::vec3));
+  //checkCUDAErrorWithLine("cudaMalloc dev_coherentVel2 failed!");
+
+  // thrust pointers
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +274,51 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+  // Rule init
+  glm::vec3 preceivedCenter (0.0f, 0.0f, 0.0f);
+  glm::vec3 c (0.0f, 0.0f, 0.0f);
+  glm::vec3 perceivedVelocity (0.0f, 0.0f, 0.0f);
+  glm::vec3 newVel(0.f);
+  int rule1Neighb = 0;
+  int rule3Neighb = 0;
+
+  // init
+  glm::vec3 currentPos = pos[iSelf];
+  glm::vec3 currentVel = vel[iSelf];
+
+  // loop through all boids
+  for(int i = 0; i < N; ++i) {
+    if(i != iSelf){
+      float dist = glm::distance(pos[i], currentPos);
+      if (dist < rule1Distance) {
+        preceivedCenter += pos[i];
+        rule1Neighb++;
+      }
+      if (dist < rule2Distance) {
+        c -= (pos[i] - currentPos);
+      }
+      if (dist < rule3Distance) {
+        perceivedVelocity += vel[i];
+        rule3Neighb++;
+      }
+
+    }
+  }
+
+  // prevent division by 0
+  if (rule1Neighb > 0) {
+     preceivedCenter /= rule1Neighb;
+     newVel += (preceivedCenter - currentPos) * rule1Scale;
+  }
+  newVel += c * rule2Scale;
+  if (rule3Neighb > 0) {
+     perceivedVelocity /= rule3Neighb;
+     newVel += perceivedVelocity * rule3Scale;
+  } 
+
+ // return the new velocity
+  return newVel + currentVel;
 }
 
 /**
@@ -242,9 +327,23 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+  // Compute the index
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
   // Compute a new velocity based on pos and vel1
+  glm::vec3 newVel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+  if(glm::length(newVel) > maxSpeed) {
+    newVel = glm::normalize(newVel) * maxSpeed;
+  }
+
+  // Record the new velocity into vel2. 
+  // Question: why NOT vel1? Ans:ping-pong buffer
+  vel2[index] = newVel;  
 }
 
 /**
@@ -278,6 +377,7 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //          for(x)
 //            for(y)
 //             for(z)? Or some other order?
+// zyx order is the most memory efficient
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
@@ -289,6 +389,22 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    // Store the original boid index to indices
+    // The original boid index is used to access the boid data
+    indices[index] = index;
+
+    glm::vec3 boidPos = pos[index];
+    int ix = glm::floor((boidPos.x - gridMin.x) * inverseCellWidth);
+    int iy = glm::floor((boidPos.y - gridMin.y) * inverseCellWidth);
+    int iz = glm::floor((boidPos.z - gridMin.z) * inverseCellWidth);
+    int index1D = gridIndex3Dto1D(ix, iy, iz, gridResolution);
+
+    // Store the grid index corresponding to the current boid to gridIndices
+    gridIndices[index] = index1D;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +422,32 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  
+  // The particleGridIndices is sorted.
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+	return;
+  }
+
+  // Get cell index of the current boid
+  int currentCell = particleGridIndices[index];
+
+  // First boid in the cell
+  if (index == 0) {
+      gridCellStartIndices[currentCell] = index;
+  }
+  // Last boid in the cell
+  else if (index == N - 1) {
+      gridCellEndIndices[currentCell] = index;
+  }
+  else {
+      int lastCell = particleGridIndices[index - 1];
+      // If the current cell index is different from the last cell index, then boid is in a new cell
+      if (currentCell != lastCell) {
+          gridCellStartIndices[currentCell] = index;
+          gridCellEndIndices[lastCell] = index - 1;
+      }
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +464,113 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    // Current boid
+    glm::vec3 currentBoidPos = pos[index];
+    glm::vec3 currentBoidVel = vel1[index];
+
+    // Simulation variabless
+    glm::vec3 newVel;
+    glm::vec3 perceivedCenter(0.0f);
+    glm::vec3 c(0.0f);
+    glm::vec3 perceivedVel(0.0f);
+    int rule1Neighb = 0;
+    int rule3Neighb = 0;
+
+    // - Identify the grid cell that this particle is in
+    int ix = glm::floor((currentBoidPos.x - gridMin.x) * inverseCellWidth);
+    int iy = glm::floor((currentBoidPos.y - gridMin.y) * inverseCellWidth);
+    int iz = glm::floor((currentBoidPos.z - gridMin.z) * inverseCellWidth);
+
+    glm::vec3 InCellOffset = currentBoidPos * inverseCellWidth - glm::floor(currentBoidPos * inverseCellWidth);
+
+    // check the boid's position in the cell, and decide the search direction
+    glm::vec3 searchStart = {
+        (InCellOffset.x) < 0.5f ? -1 : 0,
+        (InCellOffset.y) < 0.5f ? -1 : 0,
+        (InCellOffset.z) < 0.5f ? -1 : 0
+    };
+
+    glm::vec3 searchEnd = {
+        (InCellOffset.x) < 0.5f ? 0 : 1,
+        (InCellOffset.y) < 0.5f ? 0 : 1,
+        (InCellOffset.z) < 0.5f ? 0 : 1
+    };
+
+// check the 3x3x3 cells around the current cell
+// In z, y, x order
+#if check27neighbors
+    for (int z = -1; z <= 1; ++z) {
+        for (int y = -1; y <= 1; ++y) {
+            for (int x = -1; x <= 1; ++x) {
+#endif
+
+#if check8neighbors
+    for (int z = searchStart.z; z <= searchEnd.z; ++z) {
+        for (int y = searchStart.y; y <= searchEnd.y; ++y) {
+            for (int x = searchStart.x; x <= searchEnd.x; ++x) {
+#endif
+                // Check boundary
+                int idxX = imax(0, imin(ix + x, gridResolution - 1));
+                int idxY = imax(0, imin(iy + y, gridResolution - 1));
+                int idxZ = imax(0, imin(iz + z, gridResolution - 1));
+                int neighborIndex = gridIndex3Dto1D(idxX, idxY, idxZ, gridResolution);
+                // Check if the cell is empty
+                int startBoidIndex = gridCellStartIndices[neighborIndex];
+                int endBoidIndex = gridCellEndIndices[neighborIndex];
+                // For each cell, read the start/end indices in the boid pointer array.
+                if (startBoidIndex != -1 && endBoidIndex != -1) {
+                    for (int id = startBoidIndex; id <= endBoidIndex; ++id) {
+                        int neighbBoidIndex = particleArrayIndices[id];
+                        if (neighbBoidIndex != index) {
+                            glm::vec3 neighbBoidPos = pos[neighbBoidIndex];
+                            float dist = glm::distance(neighbBoidPos, currentBoidPos);
+                            if (dist < rule1Distance) {
+                                rule1Neighb++;
+                                perceivedCenter += neighbBoidPos;
+                            }
+                            if (dist < rule2Distance) {
+                                c -= (neighbBoidPos - currentBoidPos);
+                            }
+                            if (dist < rule3Distance) {
+                                rule3Neighb++;
+                                perceivedVel += vel1[neighbBoidIndex];
+                            }
+                        }
+                    }
+                }
+                
+            }
+        }
+    }
+    if (rule1Neighb > 0) {
+        perceivedCenter /= rule1Neighb;
+        newVel += (perceivedCenter - pos[index]) * rule1Scale;
+    }
+    newVel += c * rule2Scale;
+    if (rule3Neighb > 0) {
+        perceivedVel /= rule3Neighb;
+        newVel += perceivedVel * rule3Scale;
+    }
+    newVel += currentBoidVel;
+    // Clamp the speed change
+    if (glm::length(newVel) > maxSpeed) {
+        newVel = glm::normalize(newVel) * maxSpeed;
+    }
+    vel2[index] = newVel;
+}
+
+
+__global__ void kernSortBuffer(int N, glm::vec3* pos, glm::vec3* vel1, glm::vec3* coherentPos, glm::vec3* coherentVel1, int* particleArrayIndices) {
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index < N) {
+        int sortedIndex = particleArrayIndices[index];
+        coherentPos[index] = pos[sortedIndex];
+		coherentVel1[index] = vel1[sortedIndex];
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +590,100 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) {
+        return;
+    }
+    // Current boid
+    glm::vec3 currentBoidPos = pos[index];
+    glm::vec3 currentBoidVel = vel1[index];
+
+    // Simulation variabless
+    glm::vec3 newVel;
+    glm::vec3 perceivedCenter(0.0f);
+    glm::vec3 c(0.0f);
+    glm::vec3 perceivedVel(0.0f);
+    int rule1Neighb = 0;
+    int rule3Neighb = 0;
+
+    // - Identify the grid cell that this particle is in
+    int ix = glm::floor((currentBoidPos.x - gridMin.x) * inverseCellWidth);
+    int iy = glm::floor((currentBoidPos.y - gridMin.y) * inverseCellWidth);
+    int iz = glm::floor((currentBoidPos.z - gridMin.z) * inverseCellWidth);
+
+
+    glm::vec3 InCellOffset = currentBoidPos * inverseCellWidth - glm::floor(currentBoidPos * inverseCellWidth);
+    
+    // check the boid's position in the cell, and decide the search direction
+    glm::vec3 searchStart = {
+        (InCellOffset.x) < 0.5f ? -1 : 0,
+        (InCellOffset.y) < 0.5f ? -1 : 0,
+        (InCellOffset.z) < 0.5f ? -1 : 0
+	};
+
+    glm::vec3 searchEnd = { 
+        (InCellOffset.x) < 0.5f ? 0 : 1,
+        (InCellOffset.y) < 0.5f ? 0 : 1,
+        (InCellOffset.z) < 0.5f ? 0 : 1
+    };
+#if check27neighbors 
+    for (int z = -1; z <= 1; ++z) {
+        for (int y = -1; y <= 1; ++y) {
+            for (int x = -1; x <= 1; ++x) {
+#endif
+
+#if check8neighbors
+    //check8neighbors
+    for (int z = searchStart.z; z <= searchEnd.z; ++z) {
+        for (int y = searchStart.y; y <= searchEnd.y; ++y) {
+            for (int x = searchStart.x; x <= searchEnd.x; ++x) {
+#endif
+                // Check boundary
+                int idxX = imax(0, imin(ix + x, gridResolution - 1));
+                int idxY = imax(0, imin(iy + y, gridResolution - 1));
+                int idxZ = imax(0, imin(iz + z, gridResolution - 1));
+                int neighborIndex = gridIndex3Dto1D(idxX, idxY, idxZ, gridResolution);
+                // Check if the cell is empty
+                int startBoidIndex = gridCellStartIndices[neighborIndex];
+                int endBoidIndex = gridCellEndIndices[neighborIndex];
+                // For each cell, read the start/end indices in the boid pointer array.
+                if (startBoidIndex != -1 && endBoidIndex != -1) {
+                    for (int id = startBoidIndex; id <= endBoidIndex; ++id) {
+                        if (id != index) {
+                            glm::vec3 neighbBoidPos = pos[id];
+                            float dist = glm::distance(neighbBoidPos, currentBoidPos);
+                            if (dist < rule1Distance) {
+                                rule1Neighb++;
+                                perceivedCenter += neighbBoidPos;
+                            }
+                            if (dist < rule2Distance) {
+                                c -= (neighbBoidPos - currentBoidPos);
+                            }
+                            if (dist < rule3Distance) {
+                                rule3Neighb++;
+                                perceivedVel += vel1[id];
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+    if (rule1Neighb > 0) {
+        perceivedCenter /= rule1Neighb;
+        newVel += (perceivedCenter - pos[index]) * rule1Scale;
+    }
+    newVel += c * rule2Scale;
+    if (rule3Neighb > 0) {
+        perceivedVel /= rule3Neighb;
+        newVel += perceivedVel * rule3Scale;
+    }
+    newVel += currentBoidVel;
+    // Clamp the speed change
+    if (glm::length(newVel) > maxSpeed) {
+        newVel = glm::normalize(newVel) * maxSpeed;
+    }
+    vel2[index] = newVel;
 }
 
 /**
@@ -349,6 +692,18 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  // Update the velocity to vel2
+  kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed");
+
+  // Use vel2 to update the position
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("Synchronize kernUpdatePos failed");
+
+  // Swap the vel1 and vel2 buffers
+  std::swap(dev_vel1, dev_vel2);
+
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +719,31 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  dim3 fullCellBlocksPerGrid((gridCellCount + blockSize - 1) / blockSize);
+
+  // Compute grid indices
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("Synchronize kernComputeIndices failed");
+
+  // Sorted [Grid cell index] [Boid index] by [Grid cell index]
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  // Set the start and end indices of each cell as -1 (Empty cel would be -1)
+  kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+  kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+  kernIdentifyCellStartEnd<<<fullBlocksPerGrid,blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+  // Update the velocity to vel2 
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("Synchronize kernUpdateVelNeighborSearchScattered failed");
+
+  // Use vel2 to update the position
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("Synchronize kernUpdatePos failed");
+  // Swap the vel1 and vel2 buffers
+  std::swap(dev_vel1, dev_vel2);
+
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +762,36 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    dim3 fullCellBlocksPerGrid((gridCellCount + blockSize - 1) / blockSize);
+
+    // Compute grid indices
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+    checkCUDAErrorWithLine("kernComputeIndices failed");
+
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    // Set the start and end indices of each cell as -1 (Empty cel would be -1)
+    kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <fullCellBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+    // use the rearranged array index buffer to reshuffle all the particle data in the simulation array.
+    kernSortBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_coherentPos, dev_coherentVel1, dev_particleArrayIndices);
+    checkCUDAErrorWithLine("kernSortBuffer failed");
+
+    // Update the velocity to vel2 
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_coherentPos, dev_coherentVel1, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdateVelNeighborSearchCoherent failed");
+
+   // Use vel2 to update the position
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_coherentPos, dev_vel2);
+    checkCUDAErrorWithLine("kernUpdatePos failed");
+
+    // Swap buffers
+    std::swap(dev_pos, dev_coherentPos);
+    std::swap(dev_vel1, dev_vel2);
+    
 }
 
 void Boids::endSimulation() {
@@ -390,6 +800,12 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_coherentPos);
+
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index fe657ed..121918f 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -17,8 +17,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -220,6 +220,10 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
+    // For average fps calculation
+    double totalFPS = 0.0;
+    int framesCount = 0;
+
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
@@ -233,6 +237,8 @@ void initShaders(GLuint * program) {
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
+        totalFPS += fps;
+        framesCount++;
       }
 
       runCUDA();
@@ -259,6 +265,8 @@ void initShaders(GLuint * program) {
       glfwSwapBuffers(window);
       #endif
     }
+    double averageFPS = totalFPS / framesCount;
+    std::cout << "Average FPS: " << averageFPS << std::endl;
     glfwDestroyWindow(window);
     glfwTerminate();
   }