Skip to content

Commit 9128a34

Browse files
add semantic search news article (#507)
* add semantic search news article * update semantic search news * update semantic search news article * update article - with a couple questions * update semantic search news article * update article - add conclusion * update article, basically done * update article, check in w Mor re time question * update article, with question for Mór * update article, question for Mór * update article - passing to Mor for review * final commit --------- Co-authored-by: Arunesh Singh <[email protected]>
1 parent 1f68cbc commit 9128a34

File tree

8 files changed

+374
-0
lines changed

8 files changed

+374
-0
lines changed

docs/articles/semantic_search_news.md

Lines changed: 374 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,374 @@
1+
# Semantic search in business news - a notebook article
2+
3+
Semantic search is revolutionizing how we discover and consume news articles, offering a more intuitive and efficient method for finding relevant content and curating personalized news feeds. By embedding the nuances and underlying concepts of text documents in vectors, we can retrieve articles that align closely with the user's interests, preferences, and browsing history.
4+
5+
Still, implementing effective semantic search for news articles presents **challenges**, including:
6+
7+
- **Response optimization**: you need to figure out how to weight data attributes in your semantic search algorithms
8+
- **Scalability and performance**: you need efficient indexing and retrieval mechanisms to handle the vast volume of news articles
9+
10+
Superlinked is designed to handle these challenges, empowering you to **scale efficiently** and - using Superlinked Spaces - **prioritize semantic relevance and/or recency so you can recommend highly relevant news articles to your users *without having to re-embed your dataset*.**
11+
12+
To illustrate, we'll take you step by step through building a **semantic-search-powered business news recommendation app**, using the following parts of Superlinked's library:
13+
14+
- **[Recency space](https://github.com/superlinked/superlinked/blob/main/notebook/feature/recency_embedding.ipynb)** - to encode the recency of a data point
15+
- **[TextSimilarity space](https://github.com/superlinked/superlinked/blob/main/notebook/feature/text_embedding.ipynb)** - to encode the semantic meaning of text data
16+
- **[Query time weights](https://github.com/superlinked/superlinked/blob/main/notebook/feature/query_time_weights.ipynb)** - to prioritize different attributes in your queries, without having to re-embed the whole dataset
17+
18+
Using these spaces to embed our articles' headlines, text, and publication dates, we'll be able to skew our results towards older or more recent news as desired, and also search using specific search terms or a specific news article.
19+
20+
Ready to begin? Let's get started!
21+
22+
Let's take a quick look at our dataset, then use Superlinked to embed the articles smartly and handle our queries.
23+
24+
## Our dataset and embeddings
25+
26+
Our dataset of [news articles](https://www.kaggle.com/datasets/rmisra/news-category-dataset) is filtered for news in the 'BUSINESS' category.
27+
28+
We'll embed:
29+
30+
- headlines
31+
- article text (short descriptions)
32+
- publication (release) dates
33+
34+
## Setup
35+
36+
First, we **install Superlinked**.
37+
38+
```python
39+
%pip install superlinked==12.19.1
40+
```
41+
42+
Now we **import all our dependencies**...
43+
44+
```python
45+
from datetime import datetime, timedelta, timezone
46+
47+
import os
48+
import sys
49+
import altair as alt
50+
import pandas as pd
51+
52+
from superlinked.evaluation.charts.recency_plotter import RecencyPlotter
53+
from superlinked.framework.common.dag.context import CONTEXT_COMMON, CONTEXT_COMMON_NOW
54+
from superlinked.framework.common.dag.period_time import PeriodTime
55+
from superlinked.framework.common.schema.schema import Schema
56+
from superlinked.framework.common.schema.schema_object import String, Timestamp
57+
from superlinked.framework.common.schema.id_schema_object import IdField
58+
from superlinked.framework.common.parser.dataframe_parser import DataFrameParser
59+
from superlinked.framework.common.util.interactive_util import get_altair_renderer
60+
from superlinked.framework.dsl.executor.in_memory.in_memory_executor import (
61+
InMemoryExecutor,
62+
InMemoryApp,
63+
)
64+
from superlinked.framework.dsl.index.index import Index
65+
from superlinked.framework.dsl.query.param import Param
66+
from superlinked.framework.dsl.query.query import Query
67+
from superlinked.framework.dsl.query.result import Result
68+
from superlinked.framework.dsl.source.in_memory_source import InMemorySource
69+
from superlinked.framework.dsl.space.text_similarity_space import TextSimilaritySpace
70+
from superlinked.framework.dsl.space.recency_space import RecencySpace
71+
72+
alt.renderers.enable(get_altair_renderer())
73+
alt.data_transformers.disable_max_rows()
74+
pd.set_option("display.max_colwidth", 190)
75+
```
76+
77+
...and **declare our constants**.
78+
79+
```python
80+
YEAR_IN_DAYS = 365
81+
TOP_N = 10
82+
DATASET_URL = "https://storage.googleapis.com/superlinked-notebook-news-dataset/business_news.json"
83+
# as the dataset contains articles from 2022 and before, we can set our application's "NOW" to this date
84+
END_OF_2022_TS = int(datetime(2022, 12, 31, 23, 59).timestamp())
85+
EXECUTOR_DATA = {CONTEXT_COMMON: {CONTEXT_COMMON_NOW: END_OF_2022_TS}}
86+
```
87+
88+
## Prepare & explore dataset
89+
90+
Let's read our data...
91+
92+
```python
93+
NROWS = int(os.getenv("NOTEBOOK_TEST_ROW_LIMIT", str(sys.maxsize)))
94+
business_news = pd.read_json(DATASET_URL, convert_dates=True).head(NROWS)
95+
```
96+
97+
...then turn the current index into a column ("id"), and convert the date column timestamp into UTC.
98+
99+
```python
100+
# we are going to need an id column
101+
business_news = business_news.reset_index().rename(columns={"index": "id"})
102+
# convert the date timestamp into utc timezone
103+
business_news["date"] = [
104+
int(date.replace(tzinfo=timezone.utc).timestamp()) for date in business_news.date
105+
]
106+
```
107+
108+
Let's take a sneak peak.
109+
110+
```python
111+
num_rows = business_news.shape[0]
112+
print(f"Our dataset contains {num_rows} articles.")
113+
business_news.head()
114+
```
115+
116+
Our dataset has 5992 articles. Here are the first 5.
117+
118+
![sneak peak into data](../assets/use_cases/semantic_search_news/sneak_peek.png)
119+
120+
### Understand release date distribution
121+
122+
So that we can set some recency time periods, we'll take a closer look at how our articles distribute over time.
123+
124+
```python
125+
# some quick transformations and an altair histogram
126+
years_to_plot: pd.DataFrame = pd.DataFrame(
127+
{
128+
"year_of_publication": [
129+
int(datetime.fromtimestamp(ts).year) for ts in business_news["date"]
130+
]
131+
}
132+
)
133+
alt.Chart(years_to_plot).mark_bar().encode(
134+
alt.X("year_of_publication:N", bin=True, title="Year of publication"),
135+
y=alt.Y("count()", title="Count of articles"),
136+
).properties(width=400, height=400)
137+
```
138+
139+
![count of articles by year of publication](../assets/use_cases/semantic_search_news/count_article-by-year_publication.png)
140+
141+
Because our oldest article was published in 2012 and we want to be able to query all our dataset articles, we should set our longer time period inclusively to around 11 years.
142+
143+
The vast majority of our articles are distributed from 2012 through 2017, so it makes sense to differentiate that period by creating another more recent 4-year period (2018-2022) when the article count is much lower.
144+
145+
We can make sure our retrieval appropriately represents the small differences between articles in our publication-dense period (2012-2017) articles by giving them additional weight. This way, differences in our publication-scarce period (2018-2022), which will be larger than in the dense period, aren't overrepresented.
146+
147+
Now let's set up Superlinked so we can efficiently optimize our retrieval.
148+
149+
## Set up Superlinked
150+
151+
First, we'll define a schema for our news articles.
152+
153+
```python
154+
# set up schema to accommodate our inputs
155+
class NewsSchema(Schema):
156+
description: String
157+
headline: String
158+
release_timestamp: Timestamp
159+
id: IdField
160+
```
161+
162+
```python
163+
news = NewsSchema()
164+
```
165+
166+
Next, to embed the characteristics of our text, we use a sentence-transformers model to create a `description_space` for news article descriptions and a `headline_space` for our headlines, and, finally, we encode each article's release date using a `recency_space`.
167+
168+
```python
169+
# textual characteristics are embedded using a sentence-transformers model
170+
description_space = TextSimilaritySpace(
171+
text=news.description, model="sentence-transformers/all-mpnet-base-v2"
172+
)
173+
headline_space = TextSimilaritySpace(
174+
text=news.headline, model="sentence-transformers/all-mpnet-base-v2"
175+
)
176+
# release date is encoded using our recency embedding algorithm
177+
recency_space = RecencySpace(
178+
timestamp=news.release_timestamp,
179+
period_time_list=[
180+
PeriodTime(timedelta(days=4 * YEAR_IN_DAYS), weight=1),
181+
PeriodTime(timedelta(days=11 * YEAR_IN_DAYS), weight=2),
182+
],
183+
negative_filter=0.0,
184+
)
185+
```
186+
187+
To query our data, we'll need to create an index of our spaces...
188+
189+
```python
190+
news_index = Index(spaces=[description_space, headline_space, recency_space])
191+
```
192+
193+
...and set up a **simple query** and a **news query**.
194+
195+
**Simple query** lets us use a search term to retrieve from both the headline and the description. Simple query also gives us the option to weight certain inputs' importance.
196+
197+
```python
198+
simple_query = (
199+
Query(
200+
news_index,
201+
weights={
202+
description_space: Param("description_weight"),
203+
headline_space: Param("headline_weight"),
204+
recency_space: Param("recency_weight"),
205+
},
206+
)
207+
.find(news)
208+
.similar(description_space.text, Param("query_text"))
209+
.similar(headline_space.text, Param("query_text"))
210+
.limit(Param("limit"))
211+
)
212+
```
213+
214+
**News query** will search our database using the vector for a specific news article. News query, like simple query, can be weighted.
215+
216+
```python
217+
news_query = (
218+
Query(
219+
news_index,
220+
weights={
221+
description_space: Param("description_weight"),
222+
headline_space: Param("headline_weight"),
223+
recency_space: Param("recency_weight"),
224+
},
225+
)
226+
.find(news)
227+
.with_vector(news, Param("news_id"))
228+
.limit(Param("limit"))
229+
)
230+
```
231+
232+
Next we parse our dataframe,...
233+
234+
```python
235+
dataframe_parser = DataFrameParser(
236+
schema=news,
237+
mapping={news.release_timestamp: "date", news.description: "short_description"},
238+
)
239+
```
240+
241+
...create an InMemorySource object to accept the data (which is stored in an InMemoryVectorDatabase), and set up our executor (with our article dataset and index) so that it takes account of context data. The executor creates vectors based on the index's grouping of Spaces.
242+
243+
```python
244+
source: InMemorySource = InMemorySource(news, parser=dataframe_parser)
245+
executor: InMemoryExecutor = InMemoryExecutor(
246+
sources=[source], indices=[news_index], context_data=EXECUTOR_DATA
247+
)
248+
app: InMemoryApp = executor.run()
249+
```
250+
251+
It's time to **input our business news data**.
252+
253+
```python
254+
source.put([business_news])
255+
256+
```
257+
258+
### Understanding recency
259+
260+
With our business news finished inputting, let's plot our recency scores.
261+
262+
```python
263+
recency_plotter = RecencyPlotter(recency_space, context_data=EXECUTOR_DATA)
264+
recency_plotter.plot_recency_curve()
265+
```
266+
267+
![recency scores for our time periods](../assets/use_cases/semantic_search_news/recency_scores.png)
268+
269+
## Queries
270+
271+
To see our query results when we run them, we'll set up a helper to present them in a notebook.
272+
273+
```python
274+
def present_result(
275+
result_to_present: Result,
276+
cols_to_keep: list[str] | None = None,
277+
) -> pd.DataFrame:
278+
if cols_to_keep is None:
279+
cols_to_keep = [
280+
"description",
281+
"headline",
282+
"release_date",
283+
"id",
284+
"similarity_score",
285+
]
286+
# parse result to dataframe
287+
df: pd.DataFrame = result_to_present.to_pandas()
288+
# transform timestamp back to release year
289+
df["release_date"] = [
290+
datetime.fromtimestamp(timestamp, tz=timezone.utc).date()
291+
for timestamp in df["release_timestamp"]
292+
]
293+
return df[cols_to_keep]
294+
```
295+
296+
Now, say we wanted to read articles about Microsoft acquiring LinkedIn - one of the biggest acquisitions of the last decade. We input our query text as follows, weighting headline and description at 1. Recency weight doesn't matter yet so we'll set it to 0.
297+
298+
```python
299+
result = app.query(
300+
simple_query,
301+
query_text="Microsoft acquires LinkedIn",
302+
description_weight=1,
303+
headline_weight=1,
304+
recency_weight=0,
305+
limit=TOP_N,
306+
)
307+
308+
present_result(result)
309+
```
310+
311+
Let's take a look at our results.
312+
313+
![microsoft acquires linkedin](../assets/use_cases/semantic_search_news/microsoft_acquires_linkedin.png)
314+
315+
The first result is about the deal. Other results are related to some aspect of the query. Let's try upweighting recency to see we can surface other, more recent, big acquisitions.
316+
317+
```python
318+
result = app.query(
319+
simple_query,
320+
query_text="Microsoft acquires LinkedIn",
321+
description_weight=1,
322+
headline_weight=1,
323+
recency_weight=1,
324+
limit=TOP_N,
325+
)
326+
327+
present_result(result)
328+
```
329+
330+
![microsoft linkedin recency upweighted](../assets/use_cases/semantic_search_news/microsoft_linkedin_recency_upweighted.png)
331+
332+
With recency upweighted, our second result is an article about the much more recent Elon Musk Twitter offer.
333+
334+
Now let's take this article and perform a search with its `news_id`, resetting recency to 0.
335+
336+
```python
337+
result = app.query(
338+
news_query,
339+
description_weight=1,
340+
headline_weight=1,
341+
recency_weight=0,
342+
news_id="849",
343+
limit=TOP_N,
344+
)
345+
346+
present_result(result)
347+
```
348+
349+
![musk acquires twitter](../assets/use_cases/semantic_search_news/musk_twitter.png)
350+
351+
Because our dataset is significantly biased towards older articles, and recency is set to 0, our query retrieves articles highly relevant to the content of our search article - focused on either Elon Musk or Twitter.
352+
353+
To get more recent articles into the mix, we can start biasing toward recency, navigating the tradeoff between recency and text similarity.
354+
355+
```python
356+
result = app.query(
357+
news_query,
358+
description_weight=1,
359+
headline_weight=1,
360+
recency_weight=1,
361+
news_id="849",
362+
limit=TOP_N,
363+
)
364+
365+
present_result(result)
366+
```
367+
368+
![musk twitter recency](../assets/use_cases/semantic_search_news/musk_twitter_recency.png)
369+
370+
## In sum
371+
372+
Whatever your semantic search use case, Superlinked Spaces enables you up to optimize your vector retrieval with a high degree of control, without incurring the time and resource costs of re-embedding your dataset. By embedding smartly (attribute by attribute) with our Recency and TextSimilarity spaces, you can prioritize or deprioritize different attributes as needed at query time.
373+
374+
Now it's your turn! Try your own simple_query and news_query in the [notebook](https://github.com/superlinked/superlinked/blob/main/notebook/semantic_search_news.ipynb). Alter the `description_weight`, `headline_weight`, and `recency_weight` on your own `query_text` and `news_id` and observe the changes in your results!
Loading
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)