Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crystal's branch #118

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1f71848
testing git
crystalpop Feb 9, 2025
f0ab030
Get url and defrag, check for domain in isValid
crystalpop Feb 9, 2025
84120a1
process_info in progress
crystalpop Feb 10, 2025
c04bbeb
add condition in extract_next_link
crystalpop Feb 11, 2025
a28912c
make relative to absolute paths
crystalpop Feb 11, 2025
1d0aedb
make sure there's content, avoid the robots.txt disallowed paths
crystalpop Feb 11, 2025
aa0ffaf
fixed typo
crystalpop Feb 11, 2025
17496cf
add calendar check
crystalpop Feb 11, 2025
394822d
Added more conditions but still getting caught in trap (e.g. Nand/sem…
crystalpop Feb 11, 2025
8765899
Still adding trap checks, this code has error
crystalpop Feb 11, 2025
798e8b8
Tried fixing repeated_segments and added Rishika's processing functio…
crystalpop Feb 11, 2025
c098d76
Added robot.txt disallow filter for is_valid, no sitemap yet, not tes…
crystalpop Feb 11, 2025
be96c39
Robot parser in progress
crystalpop Feb 12, 2025
1afa0eb
Added content length check for large files.
crystalpop Feb 12, 2025
a2fcaa2
Added page length loggin, not tested yet
crystalpop Feb 12, 2025
84a9a14
Added a couple extension checks
crystalpop Feb 12, 2025
3f8c52a
I think robot.txt checking is working.
crystalpop Feb 13, 2025
2bcce4e
Altered the robot parsing part, doens't need soup anymore. Added but …
crystalpop Feb 13, 2025
df48b5d
Uncomment robot parsing loop
crystalpop Feb 13, 2025
576b9dd
Changed robots & sitemap again, added simhash. Not tested bc server d…
crystalpop Feb 13, 2025
f82d32b
Fixed typos
crystalpop Feb 13, 2025
eba82dd
Tokenize now used correctly in compute_simhash()
crystalpop Feb 13, 2025
2f96ba6
Sitemap uncommented, if you want to handle sitemap restart.
crystalpop Feb 14, 2025
5eee41d
Added ppsx condition to is_valid
crystalpop Feb 14, 2025
3693636
Only count words over 3 chars and non-numeric
crystalpop Feb 15, 2025
eb9c0da
Most recent run
crystalpop Feb 18, 2025
8a68fe0
Added more comments, cleaned up a bit
crystalpop Feb 18, 2025
3af3875
Merge branch 'master' into crystal's-branch
crystalpop Feb 18, 2025
a28c67d
Merge pull request #1 from crystalpop/crystal's-branch
crystalpop Feb 18, 2025
a97bf31
Delete report.txt
crystalpop Feb 18, 2025
4d159b3
Update extension checks
crystalpop Feb 18, 2025
96f4aff
Update scraper.py: change threshold back to 3 MB
crystalpop Feb 18, 2025
71c69e8
Merge pull request #2 from crystalpop/master
crystalpop Feb 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion config.ini
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[IDENTIFICATION]
# Set your user agent string here.
USERAGENT = DEFAULT AGENT
USERAGENT = IR UW25 93481481,70321210,65332249,74612160

[CONNECTION]
HOST = styx.ics.uci.edu
Expand Down
1 change: 1 addition & 0 deletions content.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ICS Calendar – UC Irvine Donald Bren School of Information & Computer Sciences Skip to main content Search Clear Submit Admissions & AidBecome an Anteater Your future starts here! One of the leading schools of computing in the nation, ICS offers a broad range of undergraduate, graduate research, and graduate professional programs in Computer Science, Informatics, and Statistics with an emphasis on foundations, discovery, and experiential learning. Apply Now Welcome to ICS Mission & History Facts & Figures Admissions Undergraduate Graduate Paying for School Undergraduate Graduate Programs & AdvisingThrive as a Student Student success starts here! Undergraduate and graduate students enjoy limitless academic and extracurricular opportunities as part of the ICS community. Build Your Student Experience Undergraduate Programs Majors & Minors ICS Honors Program Undergraduate Academic Advising Graduate Programs Research Professional Graduate Academic Advising Student Experience Outreach, Access & Inclusion Career Development Clubs & Organizations Entrepreneurship Undergraduate Research ICS Tutoring Hub Campus Resources Research & DepartmentsLearn & Discover Pushing the boundaries of computing. Driven by curiosity and committed to positive change, our diverse community of faculty and students are pioneering computing technologies that are transforming our world. Explore Our Research Research at ICS Research Areas Departments Computer Science Informatics Statistics People Institutes & Centers Connected Learning Lab Cybersecurity Policy & Research Data Science Future Health Genomics & Bioinformatics HPI Machine Learning & Data Science Machine Learning & Intelligent Systems Responsible, Ethical & Accessible Tech Software Research Impact Faculty Awards & Honors Student Awards & Honors Placements in Academia Technologies & Startups News & EventsGet Involved Innovate. Collaborate. Stimulate. Get involved with the vibrant ICS community. Check out our news and participate in our events. See What's Happening Recent News Faculty Spotlights Student Spotlights Research Spotlights Alumni Spotlights Upcoming Events ICS Calendar Seminar Series ICS Distinguished Lecturer Computer Science Informatics Statistics Connected Learning Lab Cybersecurity Policy & Research Data Science Future Health Genomics & Bioinformatics HPI Machine Learning & Data Science Machine Learning & Intelligent Systems Responsible, Ethical & Accessible Tech Software Research Reports & Publications Alumni & PartnersMake an Impact Connecting with industry, engaging the community. From sponsoring capstone projects and becoming a corporate partner to supporting student scholarships and recruiting ICS students and alumni, your partnering opportunities are endless in ICS. Get Involved Alumni Events Hall of Fame Corporate & Community Engagement Capstone Projects Research Partnerships Student Recruitment Corporate Partners Industry Advisory Board Leadership Council Make a Gift Contact UsFollow UsSupport Us Home Events ICS Calendar Loading view. Views Navigation Hide filters Event Views Navigation List List Month Day Week Today Now Now - 10/31/2024 October 31 Select date. Condense Events Series Filters Changing any of the form inputs will cause the list of events to refresh with the filtered results. Done Clear Programs & Advising: Open filter Close filter Programs & Advising Career Development Clubs and Organizations Entrepreneurship Graduate Advising Graduate Programs Outreach, Access, and Inclusion Undergraduate Advising Undergraduate Programs Undergraduate Research Research Areas: Open filter Close filter Research Areas Accessible Computing AI, ML, and Natural Language Processing Algorithms and Theory All Research Areas Bayesian Statistics Biomedical Informatics and Computational Biology Biostatistics Compilers and Programming Languages Computer-Supported Cooperative Work Computer Architecture and Embedded Systems Computer Games and Virtual Worlds Computer Graphics and Vision CS Education Database and Information Systems Digital Media and Learning Distributed, Network, and Operating Systems Genomics Health Informatics Human-Computer Interaction IT and Organizations Security, Privacy, and Cryptography Software Engineering and Systems Statistics and Statistical Theory STS and Critical Information Studies Sustainability and Computing Departments: Open filter Close filter Departments Computer Science Informatics Statistics Institutes & Centers: Open filter Close filter Institutes & Centers Connected Learning Cybersecurity Policy and Research Data Science Future Health Genomics and Bioinformatics HPI Machine Learning and Data Science Machine Learning and Intelligent Systems Responsible, Ethical, and Accessible Tech Software Research Seminars: Open filter Close filter Seminars ACO Computer Science ICS Distinguished Lecturer Informatics Machine Learning and Intelligent Systems Statistics Alumni & Partners: Open filter Close filter Alumni & Partners Alumni Corporate and Community Engagement Venues: Open filter Close filter Venues Calit2 Donald Bren Hall Event Canceled ISEB Student Center UCI Anthill Pub Zoom October 2024 Master of Human-Computer Interaction & Design 10/15 October 15, 5:00 PM PT Zoom Explore the program’s innovative curriculum, discover key highlights, and learn about the admissions process. This is your chance to get all your questions answered and find out how UC Irvine can help you achieve your career goals. We can’t wait to connect with you online! Statistics Seminar Series Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies Jason Klusowski Assistant Professor, Department of Operations Research & Financial Engineering, Princeton University October 17, 4:00 PM PT 6011, Donald Bren Hall - View Map Abstract: Decoding strategies play a pivotal role in text generation for modern language models, yet a perplexing gap persists between theory and practice. Surprisingly, strategies… CS Seminar Series Vector Search and Databases Dr. Yannis Papakonstantinou Distinguished Engineer, Query Processing and GenAI at Google Cloud Databases October 18, 11:00 AM PT 6011, Donald Bren Hall - View Map Abstract: Semantic search ability, via embedding (vectors) and vector indexing, has been added to Google Cloud Platform (GCP) databases in order to enable GenAI applications.… Informatics Seminar Series (Virtual) Culturally-conscious Workforce Development: Community-based Approaches to Supporting Microbusinesses Julie Hui Assistant Professor, University of Michigan School of Information October 18, 2:00 PM PT Zoom Abstract: Building digital capacity among microbusinesses will require more than just providing broadband access; it will also involve more “culturally-conscious” approaches that leverage community assets… Master of Computer Science – Information Session October 21, 9:00 AM PT Zoom Explore the program’s innovative curriculum, discover key highlights, and learn about the admissions process. This is your chance to get all your questions answered and find out how UC Irvine can help you achieve your career goals. We can’t wait to connect with you online! Master of Software Engineering – Information Session October 22, 12:00 PM PT Zoom Don’t miss this opportunity to elevate your career and become a skilled professional in software engineering. Register now for the Online Webinar and Q&A Session and step into a future of endless possibilities! Statistics Seminar Series Causal Inference and Machine Learning in Mobile Health Tianchen Qian Assistant Professor, Department of Statistics, UC Irvine October 24, 4:00 PM PT 6011, Donald Bren Hall - View Map Abstract: Mobile health (mHealth) interventions, such as text messages and push notifications targeting behavior change, are a promising alternative to in-person healthcare. Understanding how the… MedTech Innovation Hackathon October 25, 9:00 AM - October 26, 5:00 PM PT The Department of Informatics and the Department of Anesthesiology & Perioperative Care are thrilled to present a unique hackathon dedicated to addressing real-world medical challenges. Register now! ACO Seminar Series Marketplace Design Challenges of Online Display Advertising Paul R. Milgrom Shirley and Leonard Ely Professor of Humanities and Sciences in the Department of Economics at Stanford University October 25, 11:00 AM PT Calit2 Abstract: In September 2024, a trial began in federal court in which the Department of Justice (DOJ) attacked Google's conduct in relation to its technology… Statistics Seminar Series Approximate Data Deletion and Replication with the Bayesian Influence Function Ryan Giordano Assistant Professor, Statistics, UC Berkeley October 29, 4:00 PM PT Abstract: Many model-agnostic statistical diagnostics are based on repeatedly re-fitting a model with some observations deleted or replicated. Cross-validation, the non-parametric bootstrap, and outlier detection… Master of Computer Science – Information Session October 31, 12:00 PM PT Zoom Explore the program’s innovative curriculum, discover key highlights, and learn about the admissions process. This is your chance to get all your questions answered and find out how UC Irvine can help you achieve your career goals. We can’t wait to connect with you online! Statistics Seminar Series Statistical Learning with Dependent Data Sumanta Basu Associate Professor of Statistics and Data Science, Cornell University October 31, 4:00 PM PT 6011, Donald Bren Hall - View Map Abstract: With advances in data collection and storage, statistical learning algorithms are becoming increasingly popular for structure learning and prediction with large-scale data sets that… Previous Events Today Next Events 6210 Donald Bren Hall Irvine, CA 92697-3425 (949) 824-7427 Like us on Facebook Follow us on Twitter Follow us on YouTube Add us on LinkedIn Follow us on Instagram Footer Navigation DirectoryFaculty & Staff ResourcesFaculty & Staff PositionsEmergency PreparednessAccessibilityPrivacy PolicyUCI HomeUCI DirectoryCampus Maps © 2024 All Rights Reserved. UCI Donald Bren School of Information & Computer Sciences Skip to content Open toolbar Accessibility Tools Accessibility Tools Increase TextIncrease Text Decrease TextDecrease Text GrayscaleGrayscale High ContrastHigh Contrast Negative ContrastNegative Contrast Light BackgroundLight Background Links UnderlineLinks Underline Readable FontReadable Font Reset Reset
2 changes: 1 addition & 1 deletion crawler/frontier.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def _parse_save_file(self):
total_count = len(self.save)
tbd_count = 0
for url, completed in self.save.values():
if not completed and is_valid(url):
if not completed and is_valid(url): # took out is_valid() here
self.to_be_downloaded.append(url)
tbd_count += 1
self.logger.info(
Expand Down
87 changes: 77 additions & 10 deletions crawler/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@
import scraper
import time

from urllib.robotparser import RobotFileParser
import re
from urllib.parse import urlparse


class Worker(Thread):
def __init__(self, worker_id, config, frontier):
Expand All @@ -16,19 +20,82 @@ def __init__(self, worker_id, config, frontier):
assert {getsource(scraper).find(req) for req in {"from requests import", "import requests"}} == {-1}, "Do not use requests in scraper.py"
assert {getsource(scraper).find(req) for req in {"from urllib.request import", "import urllib.request"}} == {-1}, "Do not use urllib.request in scraper.py"
super().__init__(daemon=True)


# For parsing robots.txt files after downloading
def parse_robot_file(self, resp, robot_parser):
if resp.status == 200:
if resp.raw_response:
# split the content into lines for the RobotFileParser to parse
lines = resp.raw_response.content.decode("utf-8").splitlines()
robot_parser.parse(lines)

# Checks each domain, returns True if the robot_parser.can_fetch, False if not.
def robot_allowed(self, robot_parsers, url):
parsed = urlparse(url)
domain = parsed.netloc.lower()
if re.match(scraper.ALLOWED_DOMAINS[0], domain):
if not robot_parsers[0].can_fetch(self.config.user_agent, url):
# print(f"{url} *****DISALLOWED IN ROBOTS*****")
return False
elif re.match(scraper.ALLOWED_DOMAINS[1], domain):
if not robot_parsers[1].can_fetch(self.config.user_agent, url):
# print(f"{url} *****DISALLOWED IN ROBOTS*****")
return False
elif re.match(scraper.ALLOWED_DOMAINS[2], domain):
if not robot_parsers[2].can_fetch(self.config.user_agent, url):
# print(f"{url} *****DISALLOWED IN ROBOTS*****")
return False
elif re.match(scraper.ALLOWED_DOMAINS[3], domain):
if not robot_parsers[3].can_fetch(self.config.user_agent, url):
# print(f"{url} *****DISALLOWED IN ROBOTS*****")
return False
return True

def run(self):
# Download the robots.txt files for each domain
ics_resp = download("https://www.ics.uci.edu/robots.txt", self.config)
cs_resp = download("https://www.cs.uci.edu/robots.txt", self.config)
inf_resp = download("https://www.informatics.uci.edu/robots.txt", self.config)
stat_resp = download("https://www.stat.uci.edu/robots.txt", self.config)
robot_responses = [ics_resp, cs_resp, inf_resp, stat_resp]

# Create a robot file parser for each robots.txt file
ICS_RP = RobotFileParser()
CS_RP = RobotFileParser()
INF_RP = RobotFileParser()
STAT_RP = RobotFileParser()
robot_parsers = [ICS_RP, CS_RP, INF_RP, STAT_RP]

# Parse each robots.txt file
for i in range(0,4):
self.parse_robot_file(robot_responses[i], robot_parsers[i])

"""Attempt at checking sitemaps:
If there are sitemaps for this domain, remove the original domain from
frontier and replace with the sitemap."""
# sitemaps = robot_parsers[i].site_maps()
# if sitemaps:
# print(f"REMOVING {robot_responses[i].url[:-11]}")
# self.frontier.to_be_downloaded.remove(robot_responses[i].url[:-11])
# for map in sitemaps:
# print(f"REPLACING WITH {map}")
# self.frontier.add_url(map)


while True:
tbd_url = self.frontier.get_tbd_url()
if not tbd_url:
self.logger.info("Frontier is empty. Stopping Crawler.")
break
resp = download(tbd_url, self.config, self.logger)
self.logger.info(
f"Downloaded {tbd_url}, status <{resp.status}>, "
f"using cache {self.config.cache_server}.")
scraped_urls = scraper.scraper(tbd_url, resp)
for scraped_url in scraped_urls:
self.frontier.add_url(scraped_url)
self.frontier.mark_url_complete(tbd_url)
time.sleep(self.config.time_delay)
# Only download url if allowed in robots.txt
if self.robot_allowed(robot_parsers, tbd_url):
resp = download(tbd_url, self.config, self.logger)
self.logger.info(
f"Downloaded {tbd_url}, status <{resp.status}>, "
f"using cache {self.config.cache_server}.")

scraped_urls = scraper.scraper(tbd_url, resp)
for scraped_url in scraped_urls:
self.frontier.add_url(scraped_url)
self.frontier.mark_url_complete(tbd_url)
time.sleep(self.config.time_delay)
189 changes: 189 additions & 0 deletions finaloutput.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
Answer 1:
Number of unique pages: 15565

Answer 2:
Longest page: https://grape.ics.uci.edu/wiki/public/raw-attachment/wiki/cs221-2019-spring-project2/Team2StressTest.txt with 97610 words

Answer 3:
List of 50 most common words:
gitlab: 52397
can: 32479
research: 29950
markellekelly: 26855
project: 24910
data: 23493
ics: 20942
software: 20803
information: 20794
uci: 20037
use: 19856
student: 19470
will: 19197
example: 18167
projects: 17626
may: 16922
computer: 16521
update: 16437
group: 15896
html: 15737
new: 15435
time: 14749
support: 13772
students: 13650
search: 13648
file: 13579
name: 13527
engineering: 13380
one: 13374
using: 13213
user: 13133
events: 12618
informatics: 12403
code: 12355
set: 12336
com: 12329
graduate: 11963
automatic: 11717
undergraduate: 11671
design: 10920
also: 10902
september: 10880
university: 10857
june: 10841
select: 10824
ramesh: 10698
news: 10571
science: 10502
july: 10483
application: 10448

Answer 4:
Number of subdomains found within ics.uci.edu: 127
Subdomains with count of unique pages within each:
accessibility.ics.uci.edu: 5
acoi.ics.uci.edu: 105
aiclub.ics.uci.edu: 2
archive.ics.uci.edu: 195
asterix.ics.uci.edu: 7
betapro.proteomics.ics.uci.edu: 3
cbcl.ics.uci.edu: 81
cdb.ics.uci.edu: 43
cert.ics.uci.edu: 17
checkin.ics.uci.edu: 5
chemdb.ics.uci.edu: 1
chenli.ics.uci.edu: 10
circadiomics.ics.uci.edu: 6
cloudberry.ics.uci.edu: 45
cml.ics.uci.edu: 172
code.ics.uci.edu: 14
computableplant.ics.uci.edu: 104
courselisting.ics.uci.edu: 4
cradl.ics.uci.edu: 17
create.ics.uci.edu: 6
cs.ics.uci.edu: 12
cs260p-hub.ics.uci.edu: 2
cs260p-staging-hub.ics.uci.edu: 1
cwicsocal18.ics.uci.edu: 12
cyberclub.ics.uci.edu: 50
cybert.ics.uci.edu: 27
dgillen.ics.uci.edu: 30
ds4all.ics.uci.edu: 3
duttgroup.ics.uci.edu: 114
dynamo.ics.uci.edu: 1
eli.ics.uci.edu: 4
elms.ics.uci.edu: 11
emj.ics.uci.edu: 42
esl.ics.uci.edu: 4
evoke.ics.uci.edu: 3
flamingo.ics.uci.edu: 21
fr.ics.uci.edu: 3
frost.ics.uci.edu: 1
futurehealth.ics.uci.edu: 148
gitlab.ics.uci.edu: 2399
grape.ics.uci.edu: 398
graphics.ics.uci.edu: 2
graphmod.ics.uci.edu: 1
hack.ics.uci.edu: 2
hai.ics.uci.edu: 5
helpdesk.ics.uci.edu: 4
hobbes.ics.uci.edu: 10
hpi.ics.uci.edu: 5
hub.ics.uci.edu: 4
i-sensorium.ics.uci.edu: 5
icde2023.ics.uci.edu: 46
ics45c-hub.ics.uci.edu: 1
ics45c-staging-hub.ics.uci.edu: 1
ics46-hub.ics.uci.edu: 1
ics46-staging-hub.ics.uci.edu: 2
ics53-hub.ics.uci.edu: 1
ics53-staging-hub.ics.uci.edu: 2
ieee.ics.uci.edu: 5
industryshowcase.ics.uci.edu: 21
informatics.ics.uci.edu: 2
insite.ics.uci.edu: 7
instdav.ics.uci.edu: 1
intranet.ics.uci.edu: 8
ipubmed.ics.uci.edu: 1
isg.ics.uci.edu: 262
jgarcia.ics.uci.edu: 31
julia-hub.ics.uci.edu: 1
luci.ics.uci.edu: 3
mailman.ics.uci.edu: 9
malek.ics.uci.edu: 1
mcs.ics.uci.edu: 10
mdogucu.ics.uci.edu: 3
mds.ics.uci.edu: 27
mhcid.ics.uci.edu: 21
mlphysics.ics.uci.edu: 18
motifmap-rna.ics.uci.edu: 2
motifmap.ics.uci.edu: 2
mover.ics.uci.edu: 24
mswe.ics.uci.edu: 10
mupro.proteomics.ics.uci.edu: 3
nalini.ics.uci.edu: 7
ngs.ics.uci.edu: 2449
oai.ics.uci.edu: 5
observium.ics.uci.edu: 1
onboarding.ics.uci.edu: 1
pastebin.ics.uci.edu: 1
pepito.proteomics.ics.uci.edu: 5
pgadmin.ics.uci.edu: 1
phpmyadmin.ics.uci.edu: 50
psearch.ics.uci.edu: 1
radicle.ics.uci.edu: 1
reactions.ics.uci.edu: 7
redmiles.ics.uci.edu: 1
riscit.ics.uci.edu: 3
scale.ics.uci.edu: 6
scratch.proteomics.ics.uci.edu: 4
sdcl.ics.uci.edu: 199
seal.ics.uci.edu: 7
selectpro.proteomics.ics.uci.edu: 7
sherlock.ics.uci.edu: 6
sli.ics.uci.edu: 316
speedtest.ics.uci.edu: 1
staging-hub.ics.uci.edu: 1
stairs.ics.uci.edu: 3
statconsulting.ics.uci.edu: 1
statistics-stage.ics.uci.edu: 11
student-council.ics.uci.edu: 15
summeracademy.ics.uci.edu: 6
svn.ics.uci.edu: 1
swiki.ics.uci.edu: 221
tad.ics.uci.edu: 3
tastier.ics.uci.edu: 1
transformativeplay.ics.uci.edu: 46
tutoring.ics.uci.edu: 5
tutors.ics.uci.edu: 1
ugradforms.ics.uci.edu: 1
unite.ics.uci.edu: 10
vision.ics.uci.edu: 171
wearablegames.ics.uci.edu: 11
wics.ics.uci.edu: 1439
wiki.ics.uci.edu: 1367
www-db.ics.uci.edu: 25
www.cert.ics.uci.edu: 1
www.graphics.ics.uci.edu: 7
www.ics.uci.edu: 3159
www.informatics.ics.uci.edu: 1
xtune.ics.uci.edu: 5
Loading