Skip to content

feat: Add random state feature. #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: john-development
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 18 additions & 7 deletions src/diffpy/snmf/snmf_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,20 @@


class SNMFOptimizer:
def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500, tol=5e-7, components=None):
print("Initializing SNMF Optimizer")
def __init__(
self,
MM,
Y0=None,
X0=None,
A=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many different standards for what to name these matrices. Zero agreement between sources that use NMF. I'm inclined to eventually use what sklearn.decomposition.non_negative_factorization uses, which would mean MM->X, X->W, Y->H. But I'd like to leave this as is for the moment until there's a consensus about what would be the most clear or standard. If people will be finding this tool from the sNMF paper, there's also an argument for using the X, Y, and A names because that was used there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good. It has to be very good reason to break PEP8. The only good enough reason I can think of is to be consistent with scikit-learn. Another way of saying it is that we can "adopt the scikit-learn standard"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with adopting the scikit-learn standard, but I also like the idea of giving them descriptive (and lowercase) names. The argument against that is that some lines of code use A/X/Y 10+ times in quick succession, so it would make the code very verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more readable code is always better, so lower-case descriptive is preferred by me. I don't actually like that scikit-learn breaks this. Shall we go with lower-case? Names can be short if they are defined in a function in the docstring and docs too. Just hte code benefits from being readable, so I would say use your judgement on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started the conversion to lower case, but it's a large enough process (involving many poorly labeled sub-variables of the uppercase ones) that it feels like it should be its own separate PR. Does that make sense?

rho=1e12,
eta=610,
max_iter=500,
tol=5e-7,
components=None,
random_state=None,
):

self.MM = MM
self.X0 = X0
self.Y0 = Y0
Expand All @@ -15,23 +27,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500
# Capture matrix dimensions
self.N, self.M = MM.shape
self.num_updates = 0
self.rng = np.random.default_rng(random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive variable name? Is this a range? What is the range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.rng is not a numerical range. It's an instance of NumPy's default_rng() (introduced in NumPy 1.17), which is used to generate reproducible pseudo-random numbers when seeded with an integer from random_state. The actual range is chosen when you use the rng object in the code. We could change it, but both rng and random_state are standard names for these, including in scikit-learn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's keep the name and we can say how it is used in the docstring. Something like "The value used to initialize the random state in ..." where .... differentiates it from the docstring for random_state which is presumably the same but used somewhere different?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looking at the code, I see tha tit is generated from random_state. If this is never accessed by the user, just passed around internally, then we could make it private. We do that by giving it an underscore in front, so self._rng and then we don't have to make a docstring for it. The users never access it, but it is available to internal functions.


if Y0 is None:
if components is None:
raise ValueError("Must provide either Y0 or a number of components.")
else:
self.K = components
self.Y0 = np.random.beta(a=2.5, b=1.5, size=(self.K, self.M)) # This is untested
self.Y0 = self.rng.beta(a=2.5, b=1.5, size=(self.K, self.M))
else:
self.K = Y0.shape[0]

# Initialize A, X0 if not provided
if self.A is None:
self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3 # Small perturbation
self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K and M are probably good names if the matrix decomposition equation is in hte docstring, so they get defined there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you addressed this with your comment to MM, but as a general rule, please respond to each comment so the reviewer knows you have seen it. It wouldn't work here, but just thumbs up works if you have seen a comment and agree, but it saves time in the long run as I don't have to write this long comment...... :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'd like to put the matrix decomposition in the docstring, but I'm having trouble formatting it. Might have to ask about this in one of the meetings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am not 100% sure but I think there is a way.

if self.X0 is None:
self.X0 = np.random.rand(self.N, self.K) # Ensures values in [0,1]
self.X0 = self.rng.random((self.N, self.K))

# Initialize solution matrices to be iterated on
self.X = np.maximum(0, self.X0)
self.Y = np.maximum(0, self.Y0)

Expand Down