diff --git a/AUTHORS.md b/AUTHORS.md
index 54664fe0c..1816f73e2 100644
--- a/AUTHORS.md
+++ b/AUTHORS.md
@@ -74,6 +74,8 @@ To contributors: please add your name to the list when you submit a patch to the
* SAR PySpark improvement
* **[Daniel Schneider](https://github.com/danielsc)**
* FastAI notebook
+* **[David Davó](https://github.com/daviddavo)**
+ * Added R-Precision metric
* **[Evgenia Chroni](https://github.com/EvgeniaChroni)**
* Multinomial VAE algorithm
* Standard VAE algorithm
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 217c1c900..4d25db4e3 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -7,14 +7,18 @@ Licensed under the MIT License.
Contributions are welcomed! Here's a few things to know:
-- [Contribution Guidelines](#contribution-guidelines)
- - [Steps to Contributing](#steps-to-contributing)
- - [Coding Guidelines](#coding-guidelines)
- - [Microsoft Contributor License Agreement](#microsoft-contributor-license-agreement)
- - [Code of Conduct](#code-of-conduct)
- - [Do not point fingers](#do-not-point-fingers)
- - [Provide code feedback based on evidence](#provide-code-feedback-based-on-evidence)
- - [Ask questions do not give answers](#ask-questions-do-not-give-answers)
+- [Steps to Contributing](#steps-to-contributing)
+- [Ideas for Contributions](#ideas-for-contributions)
+ - [A first contribution](#a-first-contribution)
+ - [Datasets](#datasets)
+ - [Models](#models)
+ - [Metrics](#metrics)
+ - [General tips](#general-tips)
+- [Coding Guidelines](#coding-guidelines)
+- [Code of Conduct](#code-of-conduct)
+ - [Do not point fingers](#do-not-point-fingers)
+ - [Provide code feedback based on evidence](#provide-code-feedback-based-on-evidence)
+ - [Ask questions do not give answers](#ask-questions-do-not-give-answers)
## Steps to Contributing
@@ -33,15 +37,56 @@ Here are the basic steps to get started with your first contribution. Please rea
See the wiki for more details about our [merging strategy](https://github.com/microsoft/recommenders/wiki/Strategy-to-merge-the-code-to-main-branch).
+## Ideas for Contributions
+
+### A first contribution
+
+For people who are new to open source or to Recommenders, a good way to start is by contribution with documentation. You can help with any of the README files or in the notebooks.
+
+For more advanced users, consider fixing one of the bugs listed in the issues.
+
+### Datasets
+
+To contribute new datasets, please consider this:
+
+* Minimize dependencies, it's better to use `requests` library than a custom library.
+* Make sure that the dataset is publicly available and that the license allows for redistribution.
+
+### Models
+
+To contribute new models, please consider this:
+
+* Please don't add models that are already implemented in the repo. An exception to this rule is if you are adding a more optimal implementation or you want to migrate a model from TensorFlow to PyTorch.
+* Prioritize the minimal code necessary instead of adding a full library. If you add code from another repository, please make sure to follow the license and give proper credit.
+* All models should be accompanied by a notebook that shows how to use the model and how to train it. The notebook should be in the [examples](examples) folder.
+* The model should be tested with unit tests, and the notebooks should be tested with functional tests.
+
+### Metrics
+
+To contribute new metrics, please consider this:
+
+* A good way to contribute with metrics is by optimizing the code of the existing ones.
+* If you are adding a new metric, please consider adding not only a CPU version, but also a PySpark version.
+* When adding the tests, make sure you check for the limits. For example, if you add an error metric, check that the error between two identical datasets is zero.
+
+### General tips
+
+* Prioritize PyTorch over TensorFlow.
+* Minimize dependencies. Around 80% of the issues in the repo are related to dependencies.
+* Avoid adding code with GPL and other copyleft licenses. Prioritize MIT, Apache, and other permissive licenses.
+* Add the copyright statement at the beginning of the file: `Copyright (c) Recommenders contributors. Licensed under the MIT License.`
+
## Coding Guidelines
We strive to maintain high quality code to make the utilities in the repository easy to understand, use, and extend. We also work hard to maintain a friendly and constructive environment. We've found that having clear expectations on the development process and consistent style helps to ensure everyone can contribute and collaborate effectively.
-Please review the [coding guidelines](https://github.com/recommenders-team/recommenders/wiki/Coding-Guidelines) wiki page to see more details about the expectations for development approach and style.
+Please review the [Coding Guidelines](https://github.com/recommenders-team/recommenders/wiki/Coding-Guidelines) wiki page to see more details about the expectations for development approach and style.
+
+## Code of Conduct
Apart from the official [Code of Conduct](CODE_OF_CONDUCT.md), in Recommenders team we adopt the following behaviors, to ensure a great working environment:
-#### Do not point fingers
+### Do not point fingers
Let’s be constructive.
@@ -51,18 +96,18 @@ Let’s be constructive.
-#### Provide code feedback based on evidence
+### Provide code feedback based on evidence
When making code reviews, try to support your ideas based on evidence (papers, library documentation, stackoverflow, etc) rather than your personal preferences.
Click here to see some examples
-"When reviewing this code, I saw that the Python implementation the metrics are based on classes, however, [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) and [tensorflow](https://www.tensorflow.org/api_docs/python/tf/metrics) use functions. We should follow the standard in the industry."
+"When reviewing this code, I saw that the Python implementation of the metrics are based on classes, however, [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) use functions. We should follow the standard in the industry."
-#### Ask questions do not give answers
+### Ask questions do not give answers
Try to be empathic.
diff --git a/README.md b/README.md
index 35f526d1a..74805200d 100644
--- a/README.md
+++ b/README.md
@@ -2,16 +2,22 @@
Copyright (c) Recommenders contributors.
Licensed under the MIT License.
-->
+
-# Recommenders
[![Documentation status](https://github.com/recommenders-team/recommenders/actions/workflows/pages/pages-build-deployment/badge.svg)](https://github.com/recommenders-team/recommenders/actions/workflows/pages/pages-build-deployment)
+[![License](https://img.shields.io/github/license/recommenders-team/recommenders.svg)](https://github.com/recommenders-team/recommenders/blob/main/LICENSE)
+[![Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![PyPI Version](https://img.shields.io/pypi/v/recommenders.svg?logo=pypi&logoColor=white)](https://pypi.org/project/recommenders)
+[![Python Versions](https://img.shields.io/pypi/pyversions/recommenders.svg?logo=python&logoColor=white)](https://pypi.org/project/recommenders)
-
+[](https://join.slack.com/t/lfaifoundation/shared_invite/zt-2iyl7zyya-g5rOO5K518CBoevyi28W6w)
+
+
## What's New (May, 2024)
-We have a new release [Recommenders 1.2.0](https://github.com/microsoft/recommenders/releases/tag/1.2.0)!
+We have a new release [Recommenders 1.2.0](https://github.com/recommenders-team/recommenders/releases/tag/1.2.0)!
So many changes since our last release. We have full tests on Python 3.8 to 3.11 (around 1800 tests), upgraded performance in many algorithms, reviewed notebooks, and many more improvements.
@@ -88,7 +94,7 @@ The table below lists the recommendation algorithms currently available in the r
| LightFM/Factorization Machine | Collaborative Filtering | Factorization Machine algorithm for both implicit and explicit feedbacks. It works in the CPU environment. | [Quick start](examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb) |
| LightGBM/Gradient Boosting Tree* | Content-Based Filtering | Gradient Boosting Tree algorithm for fast training and low memory usage in content-based problems. It works in the CPU/GPU/PySpark environments. | [Quick start in CPU](examples/00_quick_start/lightgbm_tinycriteo.ipynb) / [Deep dive in PySpark](examples/02_model_content_based_filtering/mmlspark_lightgbm_criteo.ipynb) |
| LightGCN | Collaborative Filtering | Deep learning algorithm which simplifies the design of GCN for predicting implicit feedback. It works in the CPU/GPU environment. | [Deep dive](examples/02_model_collaborative_filtering/lightgcn_deep_dive.ipynb) |
-| GeoIMC* | Collaborative Filtering | Matrix completion algorithm that has into account user and item features using Riemannian conjugate gradients optimization and following a geometric approach. It works in the CPU environment. | [Quick start](examples/00_quick_start/geoimc_movielens.ipynb) |
+| GeoIMC* | Collaborative Filtering | Matrix completion algorithm that takes into account user and item features using Riemannian conjugate gradient optimization and follows a geometric approach. It works in the CPU environment. | [Quick start](examples/00_quick_start/geoimc_movielens.ipynb) |
| GRU | Collaborative Filtering | Sequential-based algorithm that aims to capture both long and short-term user preferences using recurrent neural networks. It works in the CPU/GPU environment. | [Quick start](examples/00_quick_start/sequential_recsys_amazondataset.ipynb) |
| Multinomial VAE | Collaborative Filtering | Generative model for predicting user/item interactions. It works in the CPU/GPU environment. | [Deep dive](examples/02_model_collaborative_filtering/multi_vae_deep_dive.ipynb) |
| Neural Recommendation with Long- and Short-term User Representations (LSTUR)* | Content-Based Filtering | Neural recommendation algorithm for recommending news articles with long- and short-term user interest modeling. It works in the CPU/GPU environment. | [Quick start](examples/00_quick_start/lstur_MIND.ipynb) |
diff --git a/examples/00_quick_start/lightgbm_tinycriteo.ipynb b/examples/00_quick_start/lightgbm_tinycriteo.ipynb
index f7a786415..ffd827eac 100644
--- a/examples/00_quick_start/lightgbm_tinycriteo.ipynb
+++ b/examples/00_quick_start/lightgbm_tinycriteo.ipynb
@@ -717,7 +717,7 @@
"source": [
"test_preds = lgb_model.predict(test_x)\n",
"auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))\n",
- "logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)\n",
+ "logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))\n",
"res_basic = {\"auc\": auc, \"logloss\": logloss}\n",
"print(res_basic)\n"
]
@@ -904,7 +904,7 @@
],
"source": [
"auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))\n",
- "logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)\n",
+ "logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))\n",
"res_optim = {\"auc\": auc, \"logloss\": logloss}\n",
"\n",
"print(res_optim)"
@@ -959,7 +959,7 @@
],
"source": [
"auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))\n",
- "logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)\n",
+ "logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))\n",
"\n",
"print({\"auc\": auc, \"logloss\": logloss})"
]
diff --git a/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb b/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb
index 731ab0c12..fb432ccca 100644
--- a/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb
+++ b/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb
@@ -610,7 +610,7 @@
"source": [
"## 4 Discussion\n",
"\n",
- "BiVAE is a new variational autoencoder tailored for dyadic data, where observations consist of measurements associated with two sets of objects, e.g., users, items and corresponding ratings. The model is symmetric, which makes it easier to extend axiliary data from both sides of users and items. In addition to preference data, the model can be applied to other types of dyadic data such as documentword matrices, and other tasks such as co-clustering. \n",
+ "BiVAE is a new variational autoencoder tailored for dyadic data, where observations consist of measurements associated with two sets of objects, e.g., users, items and corresponding ratings. The model is symmetric, which makes it easier to extend auxiliary data from both sides of users and items. In addition to preference data, the model can be applied to other types of dyadic data such as document-word matrices, and other tasks such as co-clustering. \n",
"\n",
"In the paper, there is also a discussion on Constrained Adaptive Priors (CAP), a proposed method to build informative priors to mitigate the well-known posterior collapse problem. We have left out that part purposely, not to distract the audiences. Nevertheless, it is very interesting and worth taking a look. \n",
"\n",
diff --git a/examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb b/examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb
index 8e588760f..5a60091d7 100755
--- a/examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb
+++ b/examples/02_model_collaborative_filtering/lightfm_deep_dive.ipynb
@@ -22,6 +22,8 @@
"source": [
"This notebook explains the concept of a Factorization Machine based model for recommendation, it also outlines the steps to construct a pure matrix factorization and a Factorization Machine using the [LightFM](https://github.com/lyst/lightfm) package. It also demonstrates how to extract both user and item affinity from a fitted model.\n",
"\n",
+ "*NOTE: LightFM is not available in the core package of Recommenders, to run this notebook, install the experimental package with `pip install recommenders[experimental]`.*\n",
+ "\n",
"## 1. Factorization Machine model\n",
"\n",
"### 1.1 Background\n",
diff --git a/pyproject.toml b/pyproject.toml
index 18574aba5..b65d7f03c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -2,12 +2,12 @@
requires = [
"setuptools>=67",
"wheel>=0.36",
- "numpy>=1.15",
+ "numpy>=1.15,<2",
]
dependencies = [
"setuptools>=67",
"wheel>=0.36",
- "numpy>=1.15",
+ "numpy>=1.15,<2",
]
build-backend = "setuptools.build_meta"
diff --git a/recommenders/datasets/movielens.py b/recommenders/datasets/movielens.py
index c0b3b5f72..a8a8b4441 100644
--- a/recommenders/datasets/movielens.py
+++ b/recommenders/datasets/movielens.py
@@ -582,7 +582,7 @@ def unique_columns(df, *, columns):
return not df[columns].duplicated().any()
-class MockMovielensSchema(pa.SchemaModel):
+class MockMovielensSchema(pa.DataFrameModel):
"""
Mock dataset schema to generate fake data for testing purpose.
This schema is configured to mimic the Movielens dataset
diff --git a/recommenders/datasets/pandas_df_utils.py b/recommenders/datasets/pandas_df_utils.py
index 50bd83dd8..f9711ce24 100644
--- a/recommenders/datasets/pandas_df_utils.py
+++ b/recommenders/datasets/pandas_df_utils.py
@@ -163,7 +163,7 @@ def fit(self, df, col_rating=DEFAULT_RATING_COL):
types = df.dtypes
if not all(
[
- x == object or np.issubdtype(x, np.integer) or x == np.float
+ x == object or np.issubdtype(x, np.integer) or x == float
for x in types
]
):
diff --git a/recommenders/evaluation/python_evaluation.py b/recommenders/evaluation/python_evaluation.py
index dff164ab4..9c7b9115f 100644
--- a/recommenders/evaluation/python_evaluation.py
+++ b/recommenders/evaluation/python_evaluation.py
@@ -435,9 +435,9 @@ def merge_ranking_true_pred(
# count the number of hits vs actual relevant items per user
df_hit_count = pd.merge(
- df_hit.groupby(col_user, as_index=False)[col_user].agg({"hit": "count"}),
+ df_hit.groupby(col_user, as_index=False)[col_user].agg(hit="count"),
rating_true_common.groupby(col_user, as_index=False)[col_user].agg(
- {"actual": "count"}
+ actual="count",
),
on=col_user,
)
diff --git a/recommenders/models/deeprec/DataModel/ImplicitCF.py b/recommenders/models/deeprec/DataModel/ImplicitCF.py
index f490c48f3..3cfbb2821 100644
--- a/recommenders/models/deeprec/DataModel/ImplicitCF.py
+++ b/recommenders/models/deeprec/DataModel/ImplicitCF.py
@@ -80,6 +80,7 @@ def _data_processing(self, train, test):
user_idx = df[[self.col_user]].drop_duplicates().reindex()
user_idx[self.col_user + "_idx"] = np.arange(len(user_idx))
self.n_users = len(user_idx)
+ self.n_users_in_train = train[self.col_user].nunique()
self.user_idx = user_idx
self.user2id = dict(
@@ -210,7 +211,7 @@ def sample_neg(x):
if neg_id not in x:
return neg_id
- indices = range(self.n_users)
+ indices = range(self.n_users_in_train)
if self.n_users < batch_size:
users = [random.choice(indices) for _ in range(batch_size)]
else:
diff --git a/setup.py b/setup.py
index a62eda445..bc72fec95 100644
--- a/setup.py
+++ b/setup.py
@@ -30,13 +30,11 @@
"category-encoders>=2.6.0,<3", # requires packaging
"cornac>=1.15.2,<3", # requires packaging, tqdm
"hyperopt>=0.2.7,<1",
- # TODO: Wait for the PR to be merged and released
- "lightfm@git+https://github.com/daviddavo/lightfm", # requires requests
"lightgbm>=4.0.0,<5",
"locust>=2.12.2,<3", # requires jinja2
"memory-profiler>=0.61.0,<1",
"nltk>=3.8.1,<4", # requires tqdm
- "notebook>=7.0.0,<8", # requires ipykernel, jinja2, jupyter, nbconvert, nbformat, packaging, requests
+ "notebook>=6.5.5,<8", # requires ipykernel, jinja2, jupyter, nbconvert, nbformat, packaging, requests
"numba>=0.57.0,<1",
"pandas>2.0.0,<3.0.0", # requires numpy
"pandera[strategies]>=0.6.5,<0.18;python_version<='3.8'", # For generating fake datasets
@@ -81,6 +79,7 @@
# nni needs to be upgraded
"nni==1.5",
"pymanopt>=0.2.5",
+ "lightfm>=1.17,<2",
]
# The following dependency can be installed as below, however PyPI does not allow direct URLs.
diff --git a/tests/ci/azureml_tests/test_groups.py b/tests/ci/azureml_tests/test_groups.py
index f05e27a9f..2a262c12d 100644
--- a/tests/ci/azureml_tests/test_groups.py
+++ b/tests/ci/azureml_tests/test_groups.py
@@ -47,8 +47,6 @@
"tests/functional/examples/test_notebooks_python.py::test_geoimc_functional", # 1006.19s
#
"tests/functional/examples/test_notebooks_python.py::test_benchmark_movielens_cpu", # 58s
- #
- "tests/functional/examples/test_notebooks_python.py::test_lightfm_functional",
],
"group_cpu_003": [ # Total group time: 2253s
"tests/data_validation/recommenders/datasets/test_criteo.py::test_download_criteo_sample", # 1.05s
@@ -237,10 +235,6 @@
"tests/unit/recommenders/models/test_geoimc.py::test_imcproblem",
"tests/unit/recommenders/models/test_geoimc.py::test_inferer_init",
"tests/unit/recommenders/models/test_geoimc.py::test_inferer_infer",
- "tests/unit/recommenders/models/test_lightfm_utils.py::test_interactions",
- "tests/unit/recommenders/models/test_lightfm_utils.py::test_fitting",
- "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_users",
- "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_items",
"tests/unit/recommenders/models/test_sar_singlenode.py::test_init",
"tests/unit/recommenders/models/test_sar_singlenode.py::test_fit",
"tests/unit/recommenders/models/test_sar_singlenode.py::test_predict",
@@ -453,3 +447,14 @@
"tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm",
],
}
+
+# Experimental are additional test groups that require to install extra dependencies: pip install .[experimental]
+experimental_test_groups = {
+ "group_cpu_001": [
+ "tests/unit/recommenders/models/test_lightfm_utils.py::test_interactions",
+ "tests/unit/recommenders/models/test_lightfm_utils.py::test_fitting",
+ "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_users",
+ "tests/unit/recommenders/models/test_lightfm_utils.py::test_sim_items",
+ "tests/functional/examples/test_notebooks_python.py::test_lightfm_functional",
+ ]
+}