Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Royce branch #109

Open
wants to merge 93 commits into
base: master
from
Open
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
351e76b
stuff
Eucalyptus5 Oct 24, 2024
19a0501
fixing repetitve web crawling
Eucalyptus5 Oct 25, 2024
cb3d73f
deleted print statement
Eucalyptus5 Oct 25, 2024
1507090
fixes
Eucalyptus5 Oct 26, 2024
558ec46
banned some domains
Eucalyptus5 Oct 26, 2024
1e05806
Royce ID
xKimChip Oct 28, 2024
7a31b19
Added n_grams and some basic multi threading safety for the deque of …
Oct 29, 2024
b4b7c8c
REGAE
xKimChip Oct 31, 2024
a1c9f98
added the trie and some extra stuff for safety purposes
Oct 31, 2024
d43afab
Alphabet
xKimChip Oct 31, 2024
d03dad6
added some basic renaming because of new type alias
Oct 31, 2024
6d7860c
follows last changes
Oct 31, 2024
a9f22c1
Fixed regex paranthesis
xKimChip Oct 31, 2024
91ad04a
added the globals to their own file to make mutexing them way easier
Oct 31, 2024
4b0d185
Merge branch 'master' into andre_branch
26dre Oct 31, 2024
ceaa9ef
Merge pull request #1 from 26dre/andre_branch
26dre Oct 31, 2024
20ed00f
put the domain trie prints in the top level module so it does not run…
Oct 31, 2024
361592e
Merge branch 'royce_branch' of https://github.com/26dre/spacetime-cra…
xKimChip Oct 31, 2024
1cc774a
added a new global variable URL_THRESHOLD_SIMILARITY that checks agai…
Oct 31, 2024
a5538ca
Merge pull request #2 from 26dre/andre_branch
26dre Oct 31, 2024
9eac990
added a basic test to link similarity.py as well as a test suite func…
Nov 1, 2024
6300977
added a test suite class and test cases for link_similarity.py to con…
Nov 2, 2024
922f659
Merge pull request #3 from 26dre/andre_branch
26dre Nov 2, 2024
0bac6ec
separated tokenizer
dillonct Nov 2, 2024
710448f
fixed id
dillonct Nov 2, 2024
8c0a954
Merge pull request #4 from 26dre/dillon_branch
dillonct Nov 2, 2024
7ed6ad2
Merge remote-tracking branch 'origin' into royce_branch
xKimChip Nov 2, 2024
ed90d08
put the ngrams in their own module to allow for easier code readability
Nov 2, 2024
acab58e
Merge pull request #5 from 26dre/andre_branch
26dre Nov 2, 2024
e9708c9
moved around all n_grams things to its own ngrams.py file in order to…
Nov 2, 2024
29f0b78
Merge pull request #6 from 26dre/andre_branch
26dre Nov 2, 2024
780ccd4
Merge remote-tracking branch 'origin' into royce_branch
xKimChip Nov 2, 2024
fe7b5d7
Merge remote-tracking branch 'origin' into royce_branch
xKimChip Nov 2, 2024
b0222ae
Merge branch 'master' of https://github.com/26dre/spacetime-crawler4py
xKimChip Nov 2, 2024
5dc73d9
small basic changes to globals and link similarity for readability
Nov 2, 2024
87e9eb3
Dates again
xKimChip Nov 2, 2024
1abfdd9
local_regex.py
xKimChip Nov 2, 2024
b2ed09d
Merge branch 'royce_branch' of https://github.com/26dre/spacetime-cra…
xKimChip Nov 2, 2024
62fb83b
changed type alias and tested
dillonct Nov 2, 2024
1defa50
Merge pull request #7 from 26dre/dillon_branch
dillonct Nov 2, 2024
9a87240
added some function to read and write to global variables in a more e…
Nov 2, 2024
5745657
Merge pull request #8 from 26dre/andre_branch
26dre Nov 2, 2024
d6f8c26
andre id
xKimChip Nov 2, 2024
72d3918
Runs, but my laptop crashed
xKimChip Nov 2, 2024
9b6d956
added some basic thread safety, functions work in theory but have not…
Nov 3, 2024
8e5030e
Merge pull request #9 from 26dre/andre_branch
26dre Nov 3, 2024
5b59bb0
added a non destructive check on whether the url should be evaluated …
Nov 3, 2024
ac287fd
Merge pull request #10 from 26dre/andre_branch
26dre Nov 3, 2024
b848495
added stuff to domainTrie in order to make it more usable and allow t…
Nov 5, 2024
0e6c35a
Merge pull request #11 from 26dre/andre_branch
26dre Nov 5, 2024
28c0c89
made a couple more data structures thread_safe, moved some functions …
Nov 5, 2024
d5fb4f4
can save files now
dillonct Nov 5, 2024
2da4137
Merge pull request #12 from 26dre/andre_branch
26dre Nov 5, 2024
5ea8ad3
fixed to globals
dillonct Nov 5, 2024
eff0a93
Merge branch 'master' into dillon
dillonct Nov 5, 2024
46b03bf
Merge pull request #13 from 26dre/dillon
dillonct Nov 5, 2024
2fae8f3
added some basic things to ngrams and some small adjustments to scraper
Nov 5, 2024
6e3652c
Merge pull request #14 from 26dre/andre_branch
26dre Nov 5, 2024
44c357d
small update to globals
Nov 5, 2024
c87201a
Merge pull request #15 from 26dre/andre_branch
26dre Nov 5, 2024
009fc59
updated word frequency logging
dillonct Nov 5, 2024
0483a04
Merge pull request #16 from 26dre/dillon
dillonct Nov 5, 2024
aa2235e
update logging
dillonct Nov 5, 2024
ebbc977
Merge pull request #17 from 26dre/dillon
dillonct Nov 5, 2024
4a23d9a
andre stop writing python code like C and updated logging
dillonct Nov 5, 2024
5b66456
fixed ngrams turns out i was doing doing some wonky stuff with additi…
Nov 9, 2024
d126ef7
Merge pull request #18 from 26dre/andre_branch
26dre Nov 9, 2024
196c447
added some extra safety checks and made everything accessible via mul…
Nov 11, 2024
68028b5
Merge pull request #19 from 26dre/andre_branch
26dre Nov 11, 2024
19463ab
added some basic changes to the scraper so that it works thru more regex
Nov 11, 2024
d8adb93
Merge pull request #20 from 26dre/andre_branch
26dre Nov 11, 2024
f29af3a
added one extra safety check inside of the worker
Nov 11, 2024
e8c51d1
Merge pull request #21 from 26dre/andre_branch
26dre Nov 11, 2024
e9c8229
changed worker so that it should work multithreaded with politeness b…
Nov 11, 2024
d892c7f
added up to 20 threads as it should be threadsafe now
Nov 11, 2024
18fc22b
Merge pull request #22 from 26dre/andre_branch
26dre Nov 11, 2024
3abcc12
combined a couple of if statements with same early return for the sak…
Nov 12, 2024
0a19d12
made some changes in globals.py so that it is made more readable now,…
Nov 12, 2024
c66e729
made the n grams code fully threadsafe (it wasn't before) just utiliz…
Nov 12, 2024
a1afef8
Merge pull request #23 from 26dre/andre_branch
26dre Nov 12, 2024
14fb443
M1 kinda complete
Eucalyptus5 Nov 13, 2024
7e01969
added query retrieval (m2), is now mostly complete (but not totally, …
Nov 16, 2024
4da5927
fixed the query input, now correctly takes in queries that include AN…
Nov 17, 2024
f2965c2
added small change that allows globals to compile on openlab
Nov 17, 2024
c6f7a45
added changes to the postings class that allow it to be hashed and us…
Nov 17, 2024
7c32cbb
added changes that allow OR and AND queries to work in a multithreade…
Nov 17, 2024
cd1806e
fully tested against the large .pkl document produced
Nov 17, 2024
103f13b
Merge pull request #24 from 26dre/andre_branch_index_cons
26dre Nov 17, 2024
b5b0ea0
added a testing suite and file for testing
Nov 25, 2024
5183f9f
Lemmatize
xKimChip Nov 26, 2024
38af34d
Index of Index
xKimChip Nov 26, 2024
0ce67bc
Starting Index of Index
xKimChip Nov 26, 2024
0b23abe
Added weights to term frequencies. I may change it to addition instea…
xKimChip Nov 30, 2024

Sorry, this diff is taking too long to generate.

It may be too large to display on GitHub.