Skip to content

Commit 9ee2ba5

Browse files
author
tiffanychu90
committed
add exercise 9 feedback
1 parent 23561fe commit 9ee2ba5

File tree

1 file changed

+39
-1
lines changed

1 file changed

+39
-1
lines changed

shweta-folder/feedback.md

+39-1
Original file line numberDiff line numberDiff line change
@@ -281,4 +281,42 @@ def make_chart(df, x_col, y_col, colorscale):
281281
)
282282
```
283283
284-
## Exercise 9
284+
## Exercise 9
285+
* Whenever you notice the weird hash in `geometry` after using `pd.read_parquet()`, that means there's geospatial data, and you should use `gpd.read_parquet()` to read in the file. The geometry column will be present and you won't have to use `shapely` to form it yourself!
286+
* Use `index = [feed_key, trip_id]` here...because `trip_id` is not unique: `pivot_max = merge.pivot_table(index= [], values='stop_sequence', aggfunc='max').reset_index()`
287+
* These left merges mean you are keeping all the stop sequences, when I think you just want to add the point geometry associated with the max or min stop sequence.
288+
```
289+
# also use feed_key within merge columns
290+
max_geom = pivot_max.merge(
291+
merge[['feed_key', 'trip_id', 'geometry', 'stop_sequence']],
292+
on=['feed_key', 'trip_id', 'stop_sequence'],
293+
how='left'
294+
)
295+
296+
# same for min_geom
297+
```
298+
* Cleaner way to do the distance calcuation would be to rename your `stop_sequence` column to `min_stop_sequence` or `max_stop_sequence` after `min_geom` and `max_geom` are created.
299+
* Merge them before calculating distance with an inner merge. There's no guarantee that the order is the same within the 2 dfs unless you merge. Also, you can only calculate distance if the pair of points are both present (if one is missing, you won't want it anyway!)
300+
* Right now, given that `min_geom` and `max_geom` are left merges, is it keeping too many rows? Also, checking the length of `min_geom`, `max_geom`, they don't match, so the distance calculation is not guaranteed.
301+
```
302+
# Only pairs of points can have distance calculated
303+
gdf = pd.merge(
304+
min_geom,
305+
max_geom,
306+
on = ['feed_key', 'trip_id'],
307+
how = "inner"
308+
)
309+
310+
# this series would match pairs row-wise exactly
311+
distance_col = gdf.geometry_x.distance(gdf.geometry)
312+
313+
# You can assign this series to the gdf safely. Or, just use assign and create it here to begin with.
314+
gdf = gdf.assign(
315+
distance = distance_col
316+
)
317+
318+
# By the end of this, the distance is already for each trip
319+
# since the merge produces a trip-level df
320+
```
321+
* For the shortest distance, I don't see a step that takes the `min()` over any grouping. Distance would be created for each stop to the highway. A bus trip makes many stops. All of those distances could be different (10 ft, 100 ft, 1,000 ft, etc), and dropping duplicates would drop the duplicates but not necessarily find the minimum.
322+
* For each trip, find the minimum stop distance to the highway. Merge in the result for the trip-level df created with `shortest_distance_hwy` column with your previous df that's also trip-level with `distance_first_last_stop` and put these 2 columns side-by-side.

0 commit comments

Comments
 (0)