Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109

JunTingLin · 2024-02-18T17:59:45Z

作者您好，

首先感謝您開發並分享這麼實用的專案。我在使用過程中發現，自從過年之後，原本透過靜態爬蟲requests去抓取http://isin.twse.com.tw/isin/C_public.jsp?strMode=2 上的所有股票代號資料的方法已經無法正常運作了。我推測這可能是網站加強了防爬機制的結果。

為了解決這個問題，我對fetch.py中的fetch_data函數進行了一番修正，改用Selenium進行動態爬蟲。考慮到可能有使用者會在無GUI環境下運行此專案，我有啟用了無頭模式（headless mode）。但...一旦啟用無頭模式後，就頻繁遇到連線失敗的問題。經過一番嘗試後，我發現了一個可行的解決方案：先訪問主頁面https://isin.twse.com.tw 並暫停幾秒，然後再去訪問目標URL，這樣就能順利獲取所需的資料了。

如果我的修改存在任何問題，或者有更好的解決方案，請隨時聯繫我。

…ing of TWSE_EQUITIES and TPEX_EQUITIES

JeffBla · 2024-04-11T04:23:18Z

Hello JunTingLin! I think I encountered the same problem with you. The update function fails.
I analyze it and make some adjustments in #110
I'm wondering whether it would be better not to use Selenium?

mitchhuang777

Consider adding try-except blocks can help handle potential exceptions.
use WebDriverWait(driver, 10).until rather than time.sleep

mitchhuang777 · 2024-05-07T20:18:11Z

twstock/codes/fetch.py

+    driver.get(main_page_url)
+    time.sleep(5)  # 等待JavaScript渲染完成
+    driver.get(url)
+    time.sleep(5)  # 等待JavaScript渲染完成


magical number is not a good way :(

mitchhuang777 · 2024-05-07T20:18:12Z

twstock/codes/fetch.py

+    # 使用WebDriver先訪問主頁面，再訪問指定的URL
+    main_page_url = "https://isin.twse.com.tw"
+    driver.get(main_page_url)
+    time.sleep(5)  # 等待JavaScript渲染完成


magical number is not a good way :(

JunTingLin added 3 commits February 19, 2024 01:28

Fix static crawl timeout issue by adopting Selenium for dynamic fetch…

289a753

…ing of TWSE_EQUITIES and TPEX_EQUITIES

Enable headless mode for Selenium

7437b82

Add selenium and webdriver_manager as dependencies

ae3c7c7

mitchhuang777 reviewed May 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109

JunTingLin commented Feb 18, 2024

JeffBla commented Apr 11, 2024

mitchhuang777 left a comment

mitchhuang777 May 7, 2024

jaki2011 Nov 9, 2024

mitchhuang777 May 7, 2024

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109

Are you sure you want to change the base?

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism #109

Conversation

JunTingLin commented Feb 18, 2024

JeffBla commented Apr 11, 2024

mitchhuang777 left a comment

Choose a reason for hiding this comment

mitchhuang777 May 7, 2024

Choose a reason for hiding this comment

jaki2011 Nov 9, 2024

Choose a reason for hiding this comment

mitchhuang777 May 7, 2024

Choose a reason for hiding this comment