Skip to content

Commit ea886a5

Browse files
authored
Merge pull request #108 from CDRH/feature/kw_normalize
Feature/kw normalize
2 parents ab7d050 + 4b92510 commit ea886a5

11 files changed

+148
-56
lines changed

CHANGELOG.md

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,12 @@
22

33
All notable changes to Apium will be documented in this file.
44

5-
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6-
and this project adheres to [Semantic
7-
Versioning](https://semver.org/spec/v2.0.0.html).
5+
Starting from Apium v1.0.1, The format is based on [Keep a
6+
Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to
7+
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).
8+
9+
Please respect the 80-character text margin and follow the [GitHub Flavored
10+
Markdown Spec](https://github.github.com/gfm/).
811

912
<!-- Template - Please preserve this order of sections
1013
## [Unreleased] - Brief description
@@ -28,8 +31,30 @@ Versioning](https://semver.org/spec/v2.0.0.html).
2831
## [Unreleased] - updates in preparation for Habeas release
2932
[Unreleased]: https://github.com/CDRH/api/compare/v1.0.4...dev
3033

34+
### Added
35+
- "api_version" added to all response "res" objects
36+
3137
### Changed
3238
- upgraded to Rails 6
39+
- Added support for aggregating buckets by normalized keyword and returning
40+
the "top_hits" first document result for a non-normalized display
41+
- Changes response format of `facets` key
42+
43+
From:
44+
```
45+
"facets": {
46+
"WILLA CATHER": 10,
47+
"Willa Cather": 50
48+
}
49+
```
50+
To:
51+
```
52+
"facets": {
53+
"willa cather": { "num" : 60, source: "Willa Cather" }
54+
}
55+
```
56+
Not only is the response format itself different, but there may be fewer
57+
facets returned since normalized values which match are combined
3358

3459
## [v1.0.4](https://github.com/CDRH/api/compare/v1.0....v1.0.4) - Updates & license
3560

@@ -38,7 +63,6 @@ Versioning](https://semver.org/spec/v2.0.0.html).
3863
license added
3964

4065
### Added
41-
4266
- Documentation on facets and highlighting
4367

4468
## [v1.0.3](https://github.com/CDRH/api/compare/v1.0.2...v1.0.3) - gem updates
@@ -67,3 +91,8 @@ license added
6791
- version moved to initializer
6892

6993
## [v1.0.0](https://github.com/CDRH/api/tree/v1.0.0) - Initial Launch
94+
95+
### Contributors
96+
97+
- Jessica Dussault (jduss4)
98+

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# Apium
22

3-
Apium is an API to access all public Center for Digital Research in the Humanities resources. It is also an invasive weed in Nebraska.
3+
Apium is an API to access all public Center for Digital Research in the Humanities resources. It is also a genus of plants which includes celery, fool's water cress, and lesser marshwort.
44

55
**[Apium Documentation](docs/README.md)**
6+
**[Changelog](CHANGELOG.md)**
67

78
This project is licensed under the terms of the [MIT license](LICENSE.md).

app/controllers/application_controller.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ def display_error(error, req_body)
1717
render(status: 500, json: JSON.pretty_generate({
1818
"res" => {
1919
"code" => 500,
20+
"api_version" => Api::Application::VERSION,
2021
"message" => "TODO",
2122
"info" => {
2223
"documentation" => "TODO",

app/controllers/collection_controller.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ def show
1515
"query_string" => request.fullpath
1616
},
1717
"res" => {
18+
"api_version" => Api::Application::VERSION,
1819
"code" => 200,
1920
"info" => {
2021
"collection" => {},

app/controllers/default_controller.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ def root
2626
"index_updated" => "TODO",
2727
"license" => METADATA["license"],
2828
"terms_of_service" => METADATA["terms_of_service"],
29-
"version" => Api::Application::VERSION,
3029
# TODO should we be obtaining these from
3130
# Rails.application.routes or similar?
3231
"endpoints" => [

app/services/search_coll_res.rb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,15 @@ def build_response
1616
"collection_name" => coll["key"],
1717
"description" => "TODO",
1818
"image_id" => "TODO",
19+
"api_version" => Api::Application::VERSION,
1920
"uri" => "TODO",
2021
"collection" => coll["key"],
2122
"item_count" => coll["doc_count"],
2223
"endpoint" => route_paths.collection_path(coll["key"])
2324
}
2425
end
2526

26-
return {
27+
{
2728
"code" => 200,
2829
"info" => {
2930
"count" => collections.length,

app/services/search_item_req.rb

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -117,26 +117,41 @@ def facets
117117
"field" => f,
118118
"order" => { type => dir },
119119
"size" => size
120+
},
121+
"aggs" => {
122+
"top_matches" => {
123+
"top_hits" => {
124+
"_source" => {
125+
"includes" => [ f ]
126+
},
127+
"size" => 1
128+
}
129+
}
120130
}
121131
}
122132
}
123133
}
124134
else
125135
aggs[f] = {
126136
"terms" => {
127-
# TODO if dataset is large, can implement partitions?
128-
# "include" => {
129-
# "partition" => 0,
130-
# "num_partitions" => 10
131-
# },
132137
"field" => f,
133138
"order" => { type => dir },
134139
"size" => size
140+
},
141+
"aggs" => {
142+
"top_matches" => {
143+
"top_hits" => {
144+
"_source" => {
145+
"includes" => [ f ]
146+
},
147+
"size" => 1
148+
}
149+
}
135150
}
136151
}
137152
end
138153
end
139-
return aggs
154+
aggs
140155
end
141156

142157
def filters
@@ -206,7 +221,7 @@ def filters
206221
filter_list << { "term" => { filter[0] => filter[1].gsub(/\r/, "") } }
207222
end
208223
end
209-
return filter_list
224+
filter_list
210225
end
211226

212227
def highlights
@@ -228,7 +243,7 @@ def highlights
228243
end
229244
end
230245
end
231-
return hl
246+
hl
232247
end
233248

234249
def sort
@@ -275,7 +290,7 @@ def sort
275290

276291
end
277292

278-
return sort_obj
293+
sort_obj
279294
end
280295

281296
def source
@@ -285,7 +300,7 @@ def source
285300
criteria = {}
286301
criteria["includes"] = wlist if !wlist.empty?
287302
criteria["excludes"] = blist if !blist.empty?
288-
return criteria
303+
criteria
289304
end
290305

291306
def text_search
@@ -309,7 +324,7 @@ def text_search
309324
else
310325
must = { "match_all" => {} }
311326
end
312-
return must
327+
must
313328
end
314329

315330
end

app/services/search_item_res.rb

Lines changed: 66 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ class SearchItemRes
44

55
@@count = ["hits", "total"]
66
@@facets = ["aggregations"]
7+
@@facets_label = ["top_matches", "hits", "hits", "_source"]
78
@@item = ["hits", "hits", 0, "_source"]
89
@@items = ["hits", "hits"]
910

@@ -18,9 +19,10 @@ def build_response
1819
items = combine_highlights
1920
facets = reformat_facets
2021

21-
return {
22+
{
2223
"code" => 200,
2324
"count" => count,
25+
"api_version" => Api::Application::VERSION,
2426
"facets" => facets,
2527
"items" => items,
2628
}
@@ -29,45 +31,83 @@ def build_response
2931
def combine_highlights
3032
hits = @body.dig(*@@items)
3133
if hits
32-
return hits.map do |hit|
34+
hits.map do |hit|
3335
hit["_source"]["highlight"] = hit["highlight"] || {}
3436
hit["_source"]
3537
end
3638
else
37-
return []
39+
[]
3840
end
3941
end
4042

43+
def find_source_from_top_hits(top_hits, field, key)
44+
# elasticsearch stores nested source results without the "path"
45+
nested_child = field.split(".").last
46+
hit = top_hits.first.dig("_source", nested_child)
47+
# if this is a multivalued field (for example: works or places),
48+
# ALL of the values come back as the source, but we only want
49+
# the single value from which the key was derived
50+
if hit.class == Array
51+
# I don't love this, because we will have to match exactly the logic
52+
# that got us the key to get this to work
53+
match_index = hit
54+
.map { |s| remove_nonword_chars(s) }
55+
.index(remove_nonword_chars(key))
56+
# if nothing matches the original key, return the entire source hit
57+
# should return a string, regardless
58+
return match_index ? hit[match_index] : hit.join(" ")
59+
else
60+
# it must be single-valued and therefore we are good to go
61+
return hit
62+
end
63+
end
64+
65+
def format_bucket_value(facets, field, bucket)
66+
# dates return in wonktastic ways, so grab key_as_string instead of gibberish number
67+
# but otherwise just grab the key if key_as_string unavailable
68+
key = bucket.key?("key_as_string") ? bucket["key_as_string"] : bucket["key"]
69+
val = bucket["doc_count"]
70+
source = key
71+
# top_matches is a top_hits aggregation which returns a list of terms
72+
# which were used for the facet.
73+
# Example: "Willa Cather" and "WILLA CATHER"
74+
# Those terms will both have been normalized as "willa cather" but
75+
# we will want to display one of the non-normalized terms instead
76+
top_hits = bucket.dig("top_matches", "hits", "hits")
77+
if top_hits
78+
source = find_source_from_top_hits(top_hits, field, key)
79+
end
80+
facets[field][key] = {
81+
"num" => val,
82+
"source" => source
83+
}
84+
end
85+
4186
def reformat_facets
42-
facets = @body.dig(*@@facets)
43-
if facets
44-
formatted = {}
45-
facets.each do |field, info|
46-
formatted[field] = {}
47-
buckets = {}
48-
# nested fields do not have buckets
49-
# at this level in the response structure
50-
if info.has_key?("buckets")
51-
buckets = info["buckets"]
52-
else
53-
buckets = info.dig(field, "buckets")
54-
end
87+
raw_facets = @body.dig(*@@facets)
88+
if raw_facets
89+
facets = {}
90+
raw_facets.each do |field, info|
91+
facets[field] = {}
92+
# nested fields do not have buckets at this level of response structure
93+
buckets = info.key?("buckets") ? info["buckets"] : info.dig(field, "buckets")
5594
if buckets
56-
buckets.each do |b|
57-
# dates return in wonktastic ways, so grab key_as_string instead of gibberish number
58-
# but otherwise just grab the key if key_as_string unavailable
59-
key = b.has_key?("key_as_string") ? b["key_as_string"] : b["key"]
60-
val = b["doc_count"]
61-
formatted[field][key] = val
62-
end
95+
buckets.each { |b| format_bucket_value(facets, field, b) }
6396
else
64-
formatted[field] = {}
97+
facets[field] = {}
6598
end
6699
end
67-
return formatted
100+
facets
68101
else
69-
return {}
102+
{}
70103
end
71104
end
72105

106+
def remove_nonword_chars(term)
107+
# transliterate to ascii (ø -> o)
108+
transliterated = I18n.transliterate(term)
109+
# remove html tags like em, u, and strong, then strip remaining non-alpha characters
110+
transliterated.gsub(/<\/?(?:em|strong|u)>|\W/, "").downcase
111+
end
112+
73113
end

app/services/search_service.rb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ def initialize(url, params={}, user_req)
1212

1313
def post(url_ending, json)
1414
res = RestClient.post("#{@url}/#{url_ending}", json.to_json, { "content-type" => "json" } )
15-
return JSON.parse(res.body)
15+
JSON.parse(res.body)
1616
rescue => e
17-
return e
17+
e
1818
end
1919

2020
def search_collections
@@ -108,7 +108,7 @@ def on_success(req, res)
108108
if @params["debug"].present?
109109
json["req"]["query_obj"] = req
110110
end
111-
return json
111+
json
112112
end
113113

114114
def build_collections_response(res)

0 commit comments

Comments
 (0)