Skip to content

Commit

Permalink
Merge pull request #33 from AbsaOSS/feature/models-and-interfaces
Browse files Browse the repository at this point in the history
Add models and interfaces
  • Loading branch information
OlivieFranklova authored Nov 26, 2024
2 parents 3399ac2 + e09c45d commit 6fe5e2e
Show file tree
Hide file tree
Showing 126 changed files with 16,095 additions and 5,627 deletions.
15 changes: 8 additions & 7 deletions .github/workflows/py_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ jobs:

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pylint
- name: Analysing the code with pylint
Expand Down Expand Up @@ -50,7 +51,7 @@ jobs:
python-tests:
env:
TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py test/test_column2VecCache.py
TEST_FILES: tests/similarity_framework/test_similarity* tests/column2vec/test_column2vec_cache.py
name: Run Python Tests
runs-on: ubuntu-latest
steps:
Expand All @@ -66,26 +67,26 @@ jobs:
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install coverage pytest
- name: Run tests
run: coverage run --source='similarity,column2Vec' -m pytest $TEST_FILES
run: |
coverage run -m pytest $TEST_FILES
- name: Show coverage
run: coverage report -m --omit=".*.ipynb"
run: coverage report -m --omit=".*.ipynb,similarity_runner/*"

- name: Create coverage file
if: github.event_name == 'pull_request'
run: coverage xml
run: coverage xml --omit=".*.ipynb,similarity_runner/*"

- name: Get Cover
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.1
uses: orgoro/coverage@v3.2
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 0.7
thresholdNew: 0.9
thresholdNew: 0.7

- uses: actions/upload-artifact@v4
if: github.event_name == 'pull_request'
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ celerybeat.pid
*.sage.py

# Environments
.env
.config
.venv
env/
venv/
Expand Down Expand Up @@ -165,3 +165,6 @@ cython_debug/
# Custom for this project
fingerprints/
**/.DS_Store

column2vec/research
measurement
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ ignored-modules=

# Python code to execute, usually for sys.path manipulation such as
# pygtk.require().
#init-hook=
init-hook='import sys; sys.path.append("./similarity"); sys.path.append("./similarityRunner")'

# Use multiple processes to speed up Pylint. Specifying 0 will auto-detect the
# number of processors available to use, and will cap the count on Windows to
Expand Down
26 changes: 18 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ the main set (training) on which the program is
tuned, and a validation set for validating the results.

#### Definition of table similarity:
![img_1.png](images/similarity_def.png)
![img_1.png](docs/similarity_def.png)
>Parameter **important columns** is user input.
>
>Parameter **k** is also user input.
Expand All @@ -44,8 +44,8 @@ input for Comparator.
Comparator compares metadata and it computes distance.
We should test which one is better.

1. ![img_2.png](images/pipeline1.png)
2. ![img_3.png](images/pipeline2.png)
1. ![img_2.png](similarity_framework/docs/pipeline1.png)
2. ![img_3.png](similarity_framework/docs/pipeline2.png)
#### Metadata creator
MetadataCreator has:
- **constructor** that fills fields:
Expand Down Expand Up @@ -102,7 +102,7 @@ of these two tables.
### Column2Vec
Column2Vec is a module in which we implement word2Vec based functionality for columns.
It will compute embeddings for columns, so we can compare them.
More about this module can be found [here](column2Vec/README.md).
More about this module can be found [here](column2vec/README.md).
### Types and Kinds
We have decided to split columns by type. We can compute types or kinds for each column.
Types define the real type of column. Some you may know from programming languages (int, float, string)
Expand All @@ -118,22 +118,22 @@ Explaining some types:
- phrase: string with more than one word
- multiple: string that represents not atomic data or structured data
- article: string with more than one sentence
3. ![img.png](images/types.png)
3. ![img.png](docs/types.png)
Kind has only for "types" plus undefined. You can see all types on the picture 4.
Explaining kinds:
- As **Id** will be marked column that contains only uniq values
- As **Bool** will be marked column that contains only two unique values
- As **Constant** will be marked column that contains only one unique value
- As **Categorical** will be marked column that contains categories. Number of uniq values is less than threshold % of the total number of rows. Threshold is different for small and big dataset.
4. ![img.png](images/kind.png)
4. ![img.png](docs/kind.png)
### Applicability
- merging teams
- fuze of companies
- found out which data are duplicated
- finding similar or different data
## Structure
- **Source code** is in folder [similarity](similarity). More about similarity folder structure in [README.md](similarity/README.md)
- **Source code for column2Vec** is in folder [column2Vec](column2Vec).
- **Source code for column2Vec** is in folder [column2Vec](column2vec).
- **Tests** are in folder [test](test)
- **Data** are stored in folders [**data**](data) and [**data_validation**](data_validation).
- **Main folder** contains: folder .github, files .gitignore, CONTRIBUTING.MD, LICENSE, README.md, requirements.txt, constants.py and main.py
Expand All @@ -142,7 +142,7 @@ Explaining kinds:
---

**column2Vec** folder contains all files for [column2Vec](#column2Vec) feature.
More about the structure of this folder can be found [here](column2Vec/README.md/#structure).
More about the structure of this folder can be found [here](column2vec/README.md/#structure).

**Datasets** for testing are stored in [**data**](data) and [**data_validation**](data_validation)
Corresponding link, name and eventual description for each dataset is
Expand Down Expand Up @@ -213,5 +213,15 @@ black similarity/metadata_creator.py
```
You can change black settings in [pyproject.toml](pyproject.toml) file.


#### Coverage
You can run it by using this command:
```bash
PYTHONPATH="./similarity:./similarityRunner:$PYTHONPATH" \
coverage run --source='similarity,column2Vec,similarityRunner' -m \
pytest test/test_similarity* test/test_runner* test/test_column2VecCache.py

```

## How to contribute
Please see our [**Contribution Guidelines**](CONTRIBUTING.md).
14 changes: 0 additions & 14 deletions column2Vec/generated/Average column2vec vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Average_column2vec_vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Clean_uniq_sentence_column2vec_vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Cleaned_sentence_column2vec.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Cleaned_sentence_column2vec_vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Clusters_Average_column2vec_clusters.html

This file was deleted.

This file was deleted.

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Clusters_Sentence_column2vec_clusters.html

This file was deleted.

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/My_Clusters.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Sentence column2vec vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Sentence_column2vec_vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Weighted_average_column2vec_vectors.html

This file was deleted.

2 changes: 0 additions & 2 deletions column2Vec/generated/cache.txt

This file was deleted.

Loading

0 comments on commit 6fe5e2e

Please sign in to comment.