Skip to content

Latest commit

 

History

History
26 lines (21 loc) · 1.44 KB

README.md

File metadata and controls

26 lines (21 loc) · 1.44 KB

Exercise 1: Web Crawling for Archival Data

  • Condition: you are trying to download scanned documents from an archive website. Since there are too many PDF files, it takes too much time to download one by one. You want to use Python scripts to download documents on the web.

  • Executing the code

    1. Go to python/ex1/notebook/
    2. Type jupyter notebook and hit enter.
    3. The code is already there. Execute block by block using Shift + Enter
    4. The output files will be saved in python/ex1/download/
    5. Check the PDF files using Excel.
  • Try changing the range of the documents

    • For now, the range of the document is set from 1 to 10.
    • Change the range so that you can download a different set of documents (don't set it too broad for this exercise -- just for saving your time now).
    • Re-run the scripts.
    • Check the download folder whether all the files are successfully downloaded.
  • Exporting the code to a HTML file with Markdown-styled text.

    1. Position your cursor in a block.
    2. Try to insert a new block by clicking the Insert menu.
    3. Change the new block's mode to Markdown.
    4. Try to type Markdown wordings.
    5. In the menu bar, click File -> Download as -> HTML
    6. Open the downloaded file in your browser. It is a pure HTML file automatically generated from Jupyter Notebook. In this way, you can generate a Python-based styled document for web.