|
| 1 | +--- |
| 2 | +title: "Extracting Duolingo vocabulary to Quizlet" |
| 3 | +date: 2020-07-05 14:06:00 +0200 |
| 4 | +tags: duolingo quizlet javascript python |
| 5 | +image: /assets/images/stories/2020-07-05.jpg |
| 6 | +language: |
| 7 | + name: Python |
| 8 | +--- |
| 9 | + |
| 10 | +You probably arrived here because you are also interested in extracting the vocabulary for a specific language from Duolingo and use the vocabulary for further purposes. In this case I used it to import it into Quizlet. |
| 11 | +<!--more--> |
| 12 | +<table class="table table-condensed"> |
| 13 | + <tr> |
| 14 | + <td><i class="mdi mdi-github-circle"></i></td> |
| 15 | + <td><a target="_blank" href="https://github.com/melledijkstra/duolingo-vocab-extractor">github.com/melledijkstra/duolingo-vocab-extractor</a></td> |
| 16 | + </tr> |
| 17 | +</table> |
| 18 | + |
| 19 | +The reason why I do this is that I have the idea that Quizlet is a bit better in spaced repetition than Duolingo at the moment. Also I like to have different ways of learning a language. So, at this moment I wanted to extract the Swahili vocabulary from Duolingo. |
| 20 | + |
| 21 | +There are simple extensions in Chrome you can utilize to extract data from a webpage easily. While that works, I also wanted to have the translations of the different terms and the sentence examples that Duolingo provides. That was a bit harder because I needed to extract the data from different places. Also, I know that it's quite easy to extract the vocabulary of the user, but the complete vocabulary of the language is a bit harder to get. Also, I haven't found an overview that shows the definitions. |
| 22 | + |
| 23 | +_My first approach was to use the `scrapy` python package to extract the data, but this quickly seemed to be a bit overkill._ |
| 24 | + |
| 25 | +So first things first, I created a python virtual environment to do all my testing. The first task is to get all the vocabulary words (terms). There are different places to retrieve the term list in different ways. For example https://duome.eu/MelleDijkstra/progress has a lot of information that is not displayed on the official website of Duolingo. It won't be too hard to extract it from there. |
| 26 | + |
| 27 | +In my case, I chose to extract the terms from the `skills` tab. From this tab I should be able to extract all the terms grouped by the exercise they are linked with. With some simple javascript I got the following result: |
| 28 | + |
| 29 | +```javascript |
| 30 | +dictionary = {} |
| 31 | +$('li.shift').each(function(i, elem) { |
| 32 | + let exercise = $(elem).find('span.sTI').text(); |
| 33 | + wordbox = $(elem).find('.blue').clone(); |
| 34 | + wordbox.find('small').remove(); |
| 35 | + let terms = wordbox.text().split(' · '); |
| 36 | + dictionary[exercise] = terms; |
| 37 | +}); |
| 38 | + |
| 39 | +JSON.stringify(dictionary); |
| 40 | +``` |
| 41 | + |
| 42 | +```json |
| 43 | +{ |
| 44 | + "Introduction": |
| 45 | +["emilian","jina","juma","la","lake","lako","langu","mchina","mholanzi","mimi","mkenya","mmarekani","mtanzania","nani","ni","ninyi","rehema","sisi","wachina","waholanzi","wakenya","wamarekani","wao","watanzania","wewe","yeye"], |
| 46 | + "Greetings 1": |
| 47 | +["asubuhi","baba","babu","bibi","dada","habari","hajambo","hamjambo","hatujambo", ... |
| 48 | + ... |
| 49 | + } |
| 50 | +``` |
| 51 | + |
| 52 | +And it gives perfectly all the vocabulary needed to get the translations. I downloaded the Swahili JSON to a file to continue in python. |
| 53 | + |
| 54 | +> Note: you can click on the tab headings first before running the script to get the specific order you prefer. |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +Nice! One step done. Now that I have all the words in the vocabulary for a particular language, I can start retrieving the translations and the example sentences that Duolingo provides. These are the same example sentences and words that you get during the exercises in the mobile application and on the web version. |
| 59 | + |
| 60 | +I needed to know the entry point of where we can find the translations. Duolingo has a dictionary for most languages, this is a good starting point. I went to https://www.duolingo.com/dictionary/Swahili/baba and used the word 'baba' as an example. I first created a python script that makes a request to this dictionary and wanted to extract the translation info from the HTML. Unfortunately that didn't go as expected. The page loads dynamically with some javascript code and with python requests this javascript does not load. Instead, I searched for the source where the javascript in turn loads the translation data. |
| 61 | + |
| 62 | +From the developer tools in the network tab it's visible where the data comes from. Easy as that, it seems that there is an unofficial API that the javascript uses to retrieve the data. With python this API is even easier to work with because it returns JSON as the response format. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +Actually there are two different endpoints (APIs) from where this data on the webpage is retrieved. One is for the dictionary which tries to match the search input to a term or word that is known in the Duolingo database. If there is a match it provides a list of possible matches with their `lexeme_id` . This `lexeme_id` is important for the second endpoint to retrieve the information for this particular lexeme. The information from the second API are very helpful. It contains the example phrases, translations, text-to-speech audio (url) and more info about the lexeme. |
| 67 | + |
| 68 | +With this information I knew where and which data to retrieve. I created the python script below which does all the necessary work and stores the information in a JSON file. This JSON file is now ready to do any further steps you want to take with the data. For me, I want to import the data into Quizlet. |
| 69 | + |
| 70 | +```python |
| 71 | +import requests as req |
| 72 | +import json |
| 73 | + |
| 74 | +def get_lexem_data(lexem_id=None, from_language_id='en'): |
| 75 | + def __get_phrases(alternative_forms): |
| 76 | + phrases = [] |
| 77 | + for form in alternative_forms: |
| 78 | + phrases.append({ |
| 79 | + 'text': form['text'], |
| 80 | + 'tts': form['tts'], |
| 81 | + 'translation_text': form['translation_text'], |
| 82 | + }) |
| 83 | + return phrases |
| 84 | + |
| 85 | + data = req.get('https://www.duolingo.com/api/1/dictionary_page', { |
| 86 | + 'lexeme_id': lexem_id, |
| 87 | + 'from_language_id': from_language_id, |
| 88 | + }).json() |
| 89 | + |
| 90 | + return { |
| 91 | + 'word': data['word'], |
| 92 | + 'lexem_id': data['lexeme_id'], |
| 93 | + 'image': data['lexeme_image'], |
| 94 | + 'tts': data['tts'], |
| 95 | + 'phrases': __get_phrases(data['alternative_forms']), |
| 96 | + } |
| 97 | + |
| 98 | + |
| 99 | +def get_translations(term, from_language_id='en'): |
| 100 | + data = req.get('https://duolingo-lexicon-prod.duolingo.com/api/1/search', { |
| 101 | + 'query': term, |
| 102 | + 'exactness': 1, |
| 103 | + 'languageId': 'sw', |
| 104 | + 'uiLanguageId': from_language_id |
| 105 | + }).json() |
| 106 | + |
| 107 | + for result in data['results']: |
| 108 | + if result['exactMatch'] is True: # we only allow exact matches |
| 109 | + return result['lexemeId'], result['translations'][from_language_id] |
| 110 | + return '', [] # no exact match is found |
| 111 | + |
| 112 | + |
| 113 | +def get_everything(term, from_language_id='en'): |
| 114 | + lexem_id, translations = get_translations(term, from_language_id) |
| 115 | + try: |
| 116 | + lexem_data = get_lexem_data(lexem_id) |
| 117 | + except: |
| 118 | + lexem_data = {} |
| 119 | + info = { |
| 120 | + 'lexem_id': lexem_id, |
| 121 | + 'translations': translations, |
| 122 | + **lexem_data, |
| 123 | + } |
| 124 | + return info |
| 125 | + |
| 126 | + |
| 127 | +if __name__ == '__main__': |
| 128 | + # here the dictionary gets loaded that was extracted previously |
| 129 | + with open('duolingo-swahili-dictionary.json', 'r') as fp: |
| 130 | + dictionary = json.load(fp) # type: dict |
| 131 | + |
| 132 | + lexicon = {} |
| 133 | + |
| 134 | + i = 0 |
| 135 | + n = len(dictionary) |
| 136 | + for exercise, terms in dictionary.items(): |
| 137 | + print(f'{i / n * 100:.0f}%\t| Exercise: {exercise} - Terms: {len(terms)} terms\n\t', end='') |
| 138 | + term_datas = [] |
| 139 | + for term in terms: |
| 140 | + print(f'{term} • ', end='') |
| 141 | + term_datas.append(get_everything(term)) |
| 142 | + lexicon[exercise] = term_datas |
| 143 | + print() |
| 144 | + i += 1 |
| 145 | + |
| 146 | + with open('duolingo-swahili-lexicon.json', 'w') as fp: |
| 147 | + json.dump(lexicon, fp) |
| 148 | +``` |
| 149 | + |
| 150 | +<div class="embed-responsive margin-tb-20 embed-responsive-16by9"> |
| 151 | + <video controls="controls"> |
| 152 | + <source src="/assets/images/story-images/video-duolingo-extract.mp4" type="video/mp4" /> |
| 153 | + Your browser does not support the video tag. |
| 154 | + </video> |
| 155 | +</div> |
| 156 | + |
| 157 | + |
| 158 | +``` |
| 159 | +0% | Exercise: M/Mi Nouns - Terms: 30 terms |
| 160 | + mche • mchezo • mchoro • mchuzi • mfereji • mfuko • miavuli • mikuki • mipira • misikiti • misitu • misumari • miti • mizizi • mji • mkasi • mkeka • mkutano • mlango • mmea • mnyororo • mpaka • mradi • mti • mto • mtumbwi • muziki • mzigo • wa • y • |
| 161 | +2% | Exercise: Food - Terms: 34 terms |
| 162 | + omba • bia • chai • chakula • chipsi • juisi • kahawa • karoti • kila • kuku • machungwa • maembe • maharage • maji • maparachichi • matunda • mayai • maziwa • mboga • mbuzi • mkate • ndizi • ng'ombe • nguruwe • nyama ya • nyanya • pilipili hoho • samaki • ugali • viazi • vinywaji • vitunguu • vitunguu saumu • wali • |
| 163 | +3% | Exercise: Chores - Terms: 22 terms |
| 164 | + fagia • washa • bafu • chumba • chupa • dirisha • faridi • fua • jamila • jiko • kikombe • meza • moto • ndoo • nguo • sabuni • safisha • sahani • sufuria • takataka • ufagio • vyombo • |
| 165 | +5% | Exercise: Present Tense 1 - Terms: 11 terms |
| 166 | + amka • cheza • penda • pika • soma • tembea • ana • lala • nina • tena • una |
| 167 | +........ |
| 168 | +``` |
| 169 | + |
| 170 | +To see exactly how I imported the translations into Quizlet, please checkout the Jupyter notebook I created inside the repository: [converter.ipynb](https://github.com/melledijkstra/duolingo-vocab-extractor/blob/master/converter.ipynb). I basically separated the term and the translation by an `@` character (this char is not present in the vocabulary) and then copy and pasted them into Quizlet. |
| 171 | + |
| 172 | +All the sets for Swahili can be found here: [https://quizlet.com/MelleDijkstra/folders/swahili-duolingo/sets](https://quizlet.com/MelleDijkstra/folders/swahili-duolingo/sets). For most of the sets I added images to the terms to help you remember the words even more. |
| 173 | + |
| 174 | +Happy learning! 😄 |
0 commit comments