Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML experiment maintenance #485

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 87 additions & 3 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ jobs:
- name: Checkout CredData
uses: actions/checkout@v3
with:
repository: Samsung/CredData
repository: babenek/CredData
ref: experiment

- name: Cache data
id: cache-data
Expand Down Expand Up @@ -62,7 +63,8 @@ jobs:
- name: Checkout CredData
uses: actions/checkout@v3
with:
repository: Samsung/CredData
repository: babenek/CredData
ref: experiment

- name: Cache data
id: cache-data
Expand Down Expand Up @@ -166,7 +168,8 @@ jobs:
- name: Checkout CredData
uses: actions/checkout@v3
with:
repository: Samsung/CredData
repository: babenek/CredData
ref: experiment

- name: Cache data
id: cache-data
Expand Down Expand Up @@ -330,4 +333,85 @@ jobs:

exit ${exit_code}

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

experiment:
# the ml train test is placed here to use cached data set
needs: [ download_data ]

runs-on: ubuntu-latest

steps:

- name: Checkout CredData
uses: actions/checkout@v3
with:
repository: babenek/CredData
ref: experiment

- name: Cache data
id: cache-data
uses: actions/cache@v3
with:
path: data
key: cred-data-${{ hashFiles('snapshot.yaml') }}

- name: Failure in case when cache missed
if: steps.cache-data.outputs.cache-hit != 'true'
run: exit 1

- name: Exclude some sets and place to CredData dir
# keep b* & c* only to easy correct experiment/src/split.json
if: steps.cache-data.outputs.cache-hit == 'true'
run: |
rm -rf data/0* data/1* data/2* data/3* data/4* data/5* data/6* data/7* data/8* data/9* data/a* data/d* data/e* data/f*
rm -rf meta/0* meta/1* meta/2* meta/3* meta/4* meta/5* meta/6* meta/7* meta/8* meta/9* meta/a* meta/d* meta/e* meta/f*
mkdir -vp ${{ github.workspace }}/CredData
mv data ${{ github.workspace }}/CredData/
mv meta ${{ github.workspace }}/CredData/

- name: Set up Python 3.8
if: steps.cache-data.outputs.cache-hit != 'true'
uses: actions/setup-python@v3
with:
python-version: "3.8"

- name: Update PIP
run: python -m pip install --upgrade pip

- name: Checkout current CredSweeper
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.sha }}
path: CredSweeper.head

- name: Install development packages
run: python -m pip install --requirement CredSweeper.head/requirements.txt

- name: Install experimental packages
# some versions will be changed for compatibility
run: python -m pip install --requirement CredSweeper.head/experiment/requirements.txt

- name: dbg
run: echo ${{ github.workspace }} && ls -al ${{ github.workspace }} && tree ${{ github.workspace }}

- name: Lighten spit.json
run: |
mv -vf ${{ github.workspace }}/CredSweeper.head/experiment/src/split.json ${{ github.workspace }}/CredSweeper.head/experiment/src/split.json.bak
cat ${{ github.workspace }}/CredSweeper.head/experiment/src/split.json.bak
grep -v '"[0-9ad-f][0-9a-f]\+' ${{ github.workspace }}/CredSweeper.head/experiment/src/split.json.bak >${{ github.workspace }}/CredSweeper.head/experiment/src/split.json
cat ${{ github.workspace }}/CredSweeper.head/experiment/src/split.json

- name: Do the experiment
run: |
cd CredSweeper.head
ls -al #dbg
pwd #dbg
export PYTHONPATH=$(pwd):${PYTHONPATH}
cd experiment
python -m credsweeper --banner #dbg
python main.py --data ${{ github.workspace }}/CredData -j $(( 2 * $(nproc) ))
ls -al results


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
12 changes: 6 additions & 6 deletions cicd/benchmark.txt
Original file line number Diff line number Diff line change
Expand Up @@ -246,12 +246,12 @@ Detected Credentials: 6112
credsweeper result_cnt : 5230, lost_cnt : 0, true_cnt : 4349, false_cnt : 881
Category TP FP TN FN FPR FNR ACC PRC RCL F1
-------------------------- ---- ---- -------- ---- --------- ---------- -------- -------- -------- --------
Authentication Key & Token 76 5 28 15 0.151515 0.164835 0.83871 0.938272 0.835165 0.883721
Authentication Key & Token 76 5 28 17 0.151515 0.182796 0.825397 0.938272 0.817204 0.873563
Generic Secret 979 7 213 84 0.0318182 0.0790216 0.929072 0.992901 0.920978 0.955588
Generic Token 295 12 592 36 0.0198676 0.108761 0.948663 0.960912 0.891239 0.924765
Other 581 722 62596 248 0.0114028 0.299156 0.984878 0.445894 0.700844 0.545028
Password 1006 131 4078 390 0.0311238 0.27937 0.907047 0.884785 0.72063 0.794315
Predefined Pattern 356 2 12 17 0.142857 0.0455764 0.950904 0.994413 0.954424 0.974008
Generic Token 295 12 593 37 0.0198347 0.111446 0.947705 0.960912 0.888554 0.923318
Other 595 722 62939 250 0.0113413 0.295858 0.984932 0.451784 0.704142 0.550416
Password 1012 131 4127 404 0.0307656 0.285311 0.90571 0.885389 0.714689 0.790934
Predefined Pattern 356 2 12 19 0.142857 0.0506667 0.946015 0.994413 0.949333 0.971351
Private Key 1011 0 29 1 0.00098814 0.999039 1 0.999012 0.999506
Seed, Salt, Nonce 45 2 6 3 0.25 0.0625 0.910714 0.957447 0.9375 0.947368
4349 881 19354109 794 4.552e-05 0.154385 0.999913 0.831549 0.845615 0.838523
4369 881 19373332 815 4.547e-05 0.157215 0.999912 0.83219 0.842785 0.837454
6 changes: 3 additions & 3 deletions credsweeper/common/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@

class KeywordPattern:
"""Pattern set of keyword types"""
key_left = r"(?P<variable>(([`'\"]+[^:='\"`<>&]*|[^:='\"`<>\s\(&]*)" \
key_left = r"(?P<variable>(([`'\"]+[^:='\"`}<>&]*|[^:='\"`}<>\s\(&]*)" \
r"(?P<keyword>"
# there will be inserted a keyword
key_right = r")" \
r"[^:='\"`<>\?\!]*)[`'\"]*)" # <variable>
r"[^:='\"`<>{?!&]*)[`'\"]*)" # <variable>
separator = r"\s*\]?\s*" \
r"(?P<separator>:( [a-z]{3,9} )?=|:|=>|!=|==|=)" \
r"(?P<separator>:( [a-z]{3,9} )?=|:|=>|!=|===|==|=)" \
r"((?!\s*ENC(\(|\[))(\s|\w)*\((\s|\w|=|\()*|\s*)"
value = r"(?P<value_leftquote>((b|r|br|rb|u|f|rf|fr|\\)?[`'\"])+)?" \
r"(?P<value>(?:\{[^}]{3,8000}\})|(?:<[^>]{3,8000}>)|" \
Expand Down
2 changes: 1 addition & 1 deletion credsweeper/ml_model/model_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,6 @@
".asciidoc", ".yaml", ".sh", ".c", ".cs", ".php", ".txt", ".yml", ".java", ".ts", ".md", ".js", ".json",
".rb", ".py", ".go"
]}},
{"type": "RuleName", "kwargs": {"rule_names": ["Token", "Secret", "AWS Client ID", "API", "Credential", "Password", "Key", "Auth"]}}
{"type": "RuleName", "kwargs": {"rule_names": ["Token", "Secret", "Github Old Token", "API", "Credential", "Password", "Key", "Auth", "JSON Web Token", "URL Credentials", "Nonce", "Salt", "Certificate"]}}
]
}
26 changes: 13 additions & 13 deletions credsweeper/rules/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@
doc_only: true

- name: API
severity: medium
severity: critical
type: keyword
values:
- api
Expand Down Expand Up @@ -169,7 +169,7 @@
min_line_len: 30

- name: Credential
severity: medium
severity: critical
type: keyword
values:
- credential
Expand Down Expand Up @@ -201,7 +201,7 @@
min_line_len: 31

- name: Github Old Token
severity: high
severity: critical
type: pattern
values:
- (?i)((git)[\w\-]*(token|key|api)[\w\-]*(\s)*(=|:|:=)(\s)*(["']?)(?P<value>[a-z|\d]{40})(["']?))
Expand Down Expand Up @@ -279,7 +279,7 @@
min_line_len: 105

- name: JSON Web Token
severity: medium
severity: critical
type: pattern
values:
- (^|[^.0-9A-Za-z_+-])(?P<value>eyJ[0-9A-Za-z_=-]{15,8000}([.0-9A-Za-z_=-]{1,8000})?)
Expand Down Expand Up @@ -313,7 +313,7 @@
min_line_len: 36

- name: Password
severity: medium
severity: critical
type: keyword
values:
- (?<!by)pass(?!ed|ing|es|\s+[a-z]{3,80})|pw(d|\b)
Expand Down Expand Up @@ -366,7 +366,7 @@
min_line_len: 40

- name: Secret
severity: medium
severity: critical
type: keyword
values:
- secret
Expand Down Expand Up @@ -476,7 +476,7 @@
min_line_len: 50

- name: Token
severity: medium
severity: critical
type: keyword
values:
- token
Expand All @@ -498,7 +498,7 @@
min_line_len: 34

- name: URL Credentials
severity: high
severity: critical
type: pattern
values:
- ://[^:\s]*(?P<separator>:)(?P<value>[^@\s]+)@
Expand All @@ -510,7 +510,7 @@
doc_available: false

- name: Auth
severity: medium
severity: critical
type: keyword
values:
- auth(?!(or|ors)(?!i[tz]))
Expand All @@ -522,7 +522,7 @@
doc_available: false

- name: Key
severity: medium
severity: critical
type: keyword
values:
- key(?!word)
Expand Down Expand Up @@ -604,7 +604,7 @@
min_line_len: 14

- name: Nonce
severity: medium
severity: critical
type: keyword
values:
- nonce
Expand All @@ -616,7 +616,7 @@
doc_available: false

- name: Salt
severity: medium
severity: critical
type: keyword
values:
- salt
Expand All @@ -628,7 +628,7 @@
doc_available: false

- name: Certificate
severity: medium
severity: critical
type: keyword
values:
- cert
Expand Down
Empty file added experiment/__init__.py
Empty file.
Loading
Loading