Skip to content

Commit e5ea7c7

Browse files
authored
Revise use cases with transformers (#507)
Add new use cases: - Text-to-image - Speech Recognition - Text Generation - (Image Segmentation now refers to [SegAny] also) Add new reference models: - Text-to-image: stable-diffusion-v1-5 - Image segmentation: segment-anything - Speech-to-text: whisper-tiny.en - Text generation: t5-small, m2m100_418M, gpt2, llama-2-7b Remove redundant local reference: [POWERFUL-FEATURES]
1 parent 7773794 commit e5ea7c7

File tree

1 file changed

+209
-7
lines changed

1 file changed

+209
-7
lines changed

index.bs

Lines changed: 209 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -437,8 +437,8 @@ A user joins a teleconference via a web-based video conferencing application at
437437
her desk since no meeting room in her office is available. During the
438438
teleconference, she does not wish that her room and people in the background are
439439
visible. To protect the privacy of the other people and the surroundings, the
440-
application runs a machine learning model such as [[DeepLabv3+]] or
441-
[[MaskR-CNN]] to semantically split an image into segments and replaces
440+
application runs a machine learning model such as [[DeepLabv3+]], [[MaskR-CNN]]
441+
or [[SegAny]] to semantically split an image into segments and replaces
442442
segments that represent other people and background with another picture.
443443

444444
### Skeleton Detection ### {#usecase-skeleton-detection}
@@ -490,6 +490,20 @@ For better accessibility, a web-based presentation application provides
490490
automatic image captioning by running a machine learning model such as
491491
[[im2txt]] which predicts explanatory words of the presentation slides.
492492

493+
### Text-to-image ### {#usecase-text-to-image}
494+
495+
Images are a core part of modern web experiences. An ability to generate images
496+
based on text input in a privacy-preserving manner enables visual
497+
personalization and adaptation of web applications and content. For example, a web
498+
application can use as an input a natural language description on the web page
499+
or a description provided by the user within a text prompt to produce an
500+
image matching the text description. This text-to-image use case enabled by
501+
latent diffusion model architecture [[LDM]] forms the basis for additional
502+
text-to-image use cases. For example, inpainting where a portion of an existing
503+
image on the web page is selectively modified using the newly generated content,
504+
or the converse, outpainting, where an original image is extended beyond its
505+
original dimensions filling the empty space with generated content.
506+
493507
### Machine Translation ### {#usecase-translation}
494508

495509
Multiple people from various countries are talking via a web-based real-time
@@ -520,6 +534,29 @@ noise suppression using Recurrent Neural Network such as [[RNNoise]] for
520534
suppressing background dynamic noise like baby cry or dog barking to improve
521535
audio experiences in video conferences.
522536

537+
### Speech Recognition ### {#usecase-speech-recognition}
538+
539+
Speech recognition, also known as speech to text, enables recognition and
540+
translation of spoken language into text. Example applications of speech
541+
recognition include transcription, automatic translation, multimodal interaction,
542+
real-time captioning and virtual assistants. Speech recognition improves
543+
accessibility of auditory content and makes it possible to interact with such
544+
content in a privacy-preserving manner in a textual form. Examples of common
545+
use cases include watching videos or participating in online meetings using
546+
real-time captioning. Models such as [[Whisper]] approach humans in their accuracy
547+
and robustness and are well positioned to improve accessibility of such use cases.
548+
549+
### Text Generation ### {#usecase-text-generation}
550+
551+
Various text generation use cases are enabled by large language models (LLM) that
552+
are able to perform tasks where a general ability to predict the next item
553+
in a text sequence is required. This class of models can translate texts, answer
554+
questions based on a text input, summarize a larger body of text, or generate
555+
text output based on a textual input. LLMs enable better performance compared to
556+
older models based on RNN, CNN, or LSTM architectures and further improve the
557+
performance of many other use cases discussed in this section.
558+
Examples of LLMs include [[t5-small]], [[m2m100_418M]], [[gpt2]], and [[llama-2-7b]].
559+
523560
### Detecting fake video ### {#usecase-detecting-fake-video}
524561

525562
A user is exposed to realistic fake videos generated by ‘deepfake’ on the web.
@@ -6524,6 +6561,25 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio
65246561
],
65256562
"date": "January 2018"
65266563
},
6564+
"SegAny": {
6565+
"href": "https://arxiv.org/abs/2304.02643",
6566+
"title": "Segment Anything",
6567+
"authors": [
6568+
"Alexander Kirillov",
6569+
"Alex Berg",
6570+
"Chloe Rolland",
6571+
"Eric Mintun",
6572+
"Hanzi Mao",
6573+
"Laura Gustafson",
6574+
"Nikhila Ravi",
6575+
"Piotr Dollar",
6576+
"Ross Girshick",
6577+
"Spencer Whitehead",
6578+
"Wan-Yen Lo",
6579+
"Tete Xiao"
6580+
],
6581+
"date": "April 2023"
6582+
},
65276583
"PoseNet": {
65286584
"href": "https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5",
65296585
"title": "Real-time Human Pose Estimation in the Browser with TensorFlow.js",
@@ -6601,6 +6657,18 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio
66016657
],
66026658
"date": "September 2016"
66036659
},
6660+
"LDM": {
6661+
"href": "https://arxiv.org/abs/2112.10752",
6662+
"title": "High-Resolution Image Synthesis with Latent Diffusion Models",
6663+
"authors": [
6664+
"Robin Rombach",
6665+
"Andreas Blattmann",
6666+
"Dominik Lorenz",
6667+
"Patrick Esser",
6668+
"Björn Ommer"
6669+
],
6670+
"date": "April 2022"
6671+
},
66046672
"GNMT": {
66056673
"href": "https://github.com/tensorflow/nmt",
66066674
"title": "Neural Machine Translation (seq2seq) Tutorial",
@@ -6674,6 +6742,19 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio
66746742
],
66756743
"date": "September 2017"
66766744
},
6745+
"Whisper": {
6746+
"href": "https://arxiv.org/abs/2212.04356",
6747+
"title": "Robust Speech Recognition via Large-Scale Weak Supervision",
6748+
"authors": [
6749+
"Alec Radford",
6750+
"Jong Wook Kim",
6751+
"Tao Xu",
6752+
"Greg Brockman",
6753+
"Christine McLeavey",
6754+
"Ilya Sutskever"
6755+
],
6756+
"date": "December 2022"
6757+
},
66776758
"GRU": {
66786759
"href": "https://arxiv.org/pdf/1406.1078.pdf",
66796760
"title": "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation",
@@ -6766,12 +6847,133 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio
67666847
],
67676848
"date": "November 2019"
67686849
},
6769-
"POWERFUL-FEATURES": {
6770-
"href": "https://w3c.github.io/webappsec-secure-contexts/",
6771-
"title": "Secure Contexts",
6850+
"t5-small": {
6851+
"href": "https://jmlr.org/papers/volume21/20-074/20-074.pdf",
6852+
"title": "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer",
6853+
"authors": [
6854+
"Colin Raffel",
6855+
"Noam Shazeer",
6856+
"Adam Roberts",
6857+
"Katherine Lee",
6858+
"Sharan Narang",
6859+
"Michael Matena",
6860+
"Yanqi Zhou",
6861+
"Wei Li",
6862+
"Peter J. Liu"
6863+
],
6864+
"date": "June 2020"
6865+
},
6866+
"m2m100_418M": {
6867+
"href": "https://arxiv.org/abs/2010.11125",
6868+
"title": "Beyond English-Centric Multilingual Machine Translation",
67726869
"authors": [
6773-
"Mike West"
6774-
]
6870+
"Angela Fan",
6871+
"Shruti Bhosale",
6872+
"Holger Schwenk",
6873+
"Zhiyi Ma",
6874+
"Ahmed El-Kishky",
6875+
"Siddharth Goyal",
6876+
"Mandeep Baines",
6877+
"Onur Celebi",
6878+
"Guillaume Wenzek",
6879+
"Vishrav Chaudhary",
6880+
"Naman Goyal",
6881+
"Tom Birch",
6882+
"Vitaliy Liptchinsky",
6883+
"Sergey Edunov",
6884+
"Edouard Grave",
6885+
"Michael Auli",
6886+
"Armand Joulin"
6887+
],
6888+
"date": "October 2020"
6889+
},
6890+
"gpt2": {
6891+
"href": "https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf",
6892+
"title": "Language Models are Unsupervised Multitask Learners",
6893+
"authors": [
6894+
"Alec Radford",
6895+
"Jeffrey Wu",
6896+
"Rewon Child",
6897+
"David Luan",
6898+
"Dario Amodei",
6899+
"Ilya Sutskever"
6900+
],
6901+
"date": "February 2019"
6902+
},
6903+
"llama-2-7b": {
6904+
"href": "https://arxiv.org/abs/2307.09288",
6905+
"title": "Llama 2: Open Foundation and Fine-Tuned Chat Models",
6906+
"authors": [
6907+
"Hugo Touvron",
6908+
"Louis Martin",
6909+
"Kevin Stone",
6910+
"Peter Albert",
6911+
"Amjad Almahairi",
6912+
"Yasmine Babaei",
6913+
"Nikolay Bashlykov",
6914+
"Soumya Batra",
6915+
"Prajjwal Bhargava",
6916+
"Shruti Bhosale",
6917+
"Dan Bikel",
6918+
"Lukas Blecher",
6919+
"Cristian Canton Ferrer",
6920+
"Moya Chen",
6921+
"Guillem Cucurull",
6922+
"David Esiobu",
6923+
"Jude Fernandes",
6924+
"Jeremy Fu",
6925+
"Wenyin Fu",
6926+
"Brian Fuller",
6927+
"Cynthia Gao",
6928+
"Vedanuj Goswami",
6929+
"Naman Goyal",
6930+
"Anthony Hartshorn",
6931+
"Saghar Hosseini",
6932+
"Rui Hou",
6933+
"Hakan Inan",
6934+
"Marcin Kardas",
6935+
"Viktor Kerkez",
6936+
"Madian Khabsa",
6937+
"Isabel Kloumann",
6938+
"Artem Korenev",
6939+
"Punit Singh Koura",
6940+
"Marie-Anne Lachaux",
6941+
"Thibaut Lavril",
6942+
"Jenya Lee",
6943+
"Diana Liskovich",
6944+
"Yinghai Lu",
6945+
"Yuning Mao",
6946+
"Xavier Martinet",
6947+
"Todor Mihaylov",
6948+
"Pushkar Mishra",
6949+
"Igor Molybog",
6950+
"Yixin Nie",
6951+
"Andrew Poulton",
6952+
"Jeremy Reizenstein",
6953+
"Rashi Rungta",
6954+
"Kalyan Saladi",
6955+
"Alan Schelten",
6956+
"Ruan Silva",
6957+
"Eric Michael Smith",
6958+
"Ranjan Subramanian",
6959+
"Xiaoqing Ellen Tan",
6960+
"Binh Tang",
6961+
"Ross Taylor",
6962+
"Adina Williams",
6963+
"Jian Xiang Kuan",
6964+
"Puxin Xu",
6965+
"Zheng Yan",
6966+
"Iliyan Zarov",
6967+
"Yuchen Zhang",
6968+
"Angela Fan",
6969+
"Melanie Kambadur",
6970+
"Sharan Narang",
6971+
"Aurelien Rodriguez",
6972+
"Robert Stojnic",
6973+
"Sergey Edunov",
6974+
"Thomas Scialom"
6975+
],
6976+
"date": "July 2023"
67756977
}
67766978
}
67776979
</pre>

0 commit comments

Comments
 (0)