Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double check formula for adding lift_data to input signal data #130

Open
bobleesj opened this issue Nov 1, 2024 · 14 comments
Open

Double check formula for adding lift_data to input signal data #130

bobleesj opened this issue Nov 1, 2024 · 14 comments

Comments

@bobleesj
Copy link
Contributor

bobleesj commented Nov 1, 2024

Problem

lift_data is applied to input signals below:

def main():
    args = create_parser()
    # Load input data, prepare tuple((1850, 1) and (1850, 50))
    grid, input_data = load_input_signals(args.input_directory)

    # Apply lifting factor to input signal data
    lifted_input_data = lift_data(input_data, args.lift_factor)
    ...

The following is the implementation of lift_data:

def lift_data(data_input: np.ndarray, lift=1):
    """Lifts values of data_input.

    Adds 'lift' * the minimum value in data_input to data_input element-wise.

    Parameters
    ----------
    data_input: 2d array like
      The matrix containing a series of signals to be decomposed. Has dimensions N x M where N is the length
      of each signal and M is the number of signals.

    lift: float
      The factor representing how much to lift 'data_input'.

    Returns
    -------
    2d array like
      The matrix that contains data_input - (min(data_input) * lift).
    """
    return data_input + np.abs(np.min(data_input) * lift)

While the the function is expected to return data_input - (min(data_input) * lift) the actual code is written as return data_input + np.abs(np.min(data_input) * lift)

Proposed solution

Double check literature (if there is one)

@Fil158
Copy link

Fil158 commented Nov 1, 2024

Ok, I'll do it and reply to you

@sbillinge
Copy link
Contributor

Also, since this is an additive correction and not a multiplicative one, let's not call it a factor. We could use term but it doesn't seem so clear in this context, or maybe quantity or amount or value? so lift_amount or lift_value. I leave the decision up to you guys.

@bobleesj thanks for catching that. Was it not caught by a test though? that is not good.....it probably means that the tests were written by running the function, getting the result, then coding that into the test! ugh! Or maybe it is a case where we didn't have test. Not sure, I didn't look. In general folks, tests first, code after.....

@Fil158
Copy link

Fil158 commented Nov 6, 2024

I searched the literature a bit: I couldn't find anything equal but I did find this method which has about the same function as ours and could be useful because it normalises spectra (it is mainly used in spectroscopy), for us it would be patterns.

import numpy as np
   def snv(data):
        # Vectorized approach to SNV
        mean = np.mean(data, axis=1, keepdims=True)  # mean of each row
        std = np.std(data, axis=1, keepdims=True)    # std of each row
        return (data - mean) / std
 
      # Usage
      normalized = snv(spectra)  # spectra is a 2D array where each row is a spectrum

Other methods are for background subtraction but this is not the case here.

I think lift_value is the best way name the term

@bobleesj
Copy link
Contributor Author

bobleesj commented Nov 6, 2024

Was it not caught by a test though? that is not good.....it probably means that the tests were written by running the function, getting the result, then coding that into the test! ugh!

I am afraid that's the case, especially given that all of the test functions do not use real data - all of the test inputs are dummy matrices, hard coded.

@bobleesj
Copy link
Contributor Author

bobleesj commented Nov 6, 2024

I searched the literature a bit: I couldn't find anything equal but I did find this method which has about the same function as ours and could be useful because it normalises spectra (it is mainly used in spectroscopy), for us it would be patterns.

import numpy as np
   def snv(data):
        # Vectorized approach to SNV
        mean = np.mean(data, axis=1, keepdims=True)  # mean of each row
        std = np.std(data, axis=1, keepdims=True)    # std of each row
        return (data - mean) / std
 
      # Usage
      normalized = snv(spectra)  # spectra is a 2D array where each row is a spectrum

Other methods are for background subtraction but this is not the case here.

I think lift_value is the best way name the term

sorry, I am a bit confused. we don't want to include new functions/features since the goal is to design our code in a way that reproduces the plots and figures from matlab. The questions is, do we data_input + np.abs(np.min(data_input) * lift) or data_input - np.abs(np.min(data_input) * lift) ? Maybe the matlab code has it? @Fil158

@sbillinge
Copy link
Contributor

I searched the literature a bit: I couldn't find anything equal but I did find this method which has about the same function as ours and could be useful because it normalises spectra (it is mainly used in spectroscopy), for us it would be patterns.

import numpy as np
   def snv(data):
        # Vectorized approach to SNV
        mean = np.mean(data, axis=1, keepdims=True)  # mean of each row
        std = np.std(data, axis=1, keepdims=True)    # std of each row
        return (data - mean) / std
 
      # Usage
      normalized = snv(spectra)  # spectra is a 2D array where each row is a spectrum

Other methods are for background subtraction but this is not the case here.
I think lift_value is the best way name the term

sorry, I am a bit confused. we don't want to include new functions/features since the goal is to design our code in a way that reproduces the plots and figures from matlab. The questions is, do we data_input + np.abs(np.min(data_input) * lift) or data_input - np.abs(np.min(data_input) * lift) ? Maybe the matlab code has it? @Fil158

yes, sorry for the confusion. This is not a research/literature issue, it is checking what ran did and making sure that we coded it up correctly.

@Fil158
Copy link

Fil158 commented Nov 7, 2024

%% remove baseline (lift)
data_input = Data - min(Data(:));

This is the code in matlab that removes the baseline by subtracting the minimum value of the entire data matrix from each element. It returns all values shifted such that the minimum value is 0.

@Fil158
Copy link

Fil158 commented Nov 7, 2024

So it seems to be more similar to this data_input - np.abs(np.min(data_input) * lift) but I'm afraid that subtraction could potentially cause more problems, so, maybe, it is better data_input + np.abs(np.min(data_input) * lift)

@bobleesj
Copy link
Contributor Author

bobleesj commented Nov 7, 2024

but I'm afraid that subtraction could potentially cause more problems

If we consider the matlab code as our source of truth, what other problems would it have?

@Fil158
Copy link

Fil158 commented Nov 7, 2024

I can think of a situation where you have 0s in the matrix and for some reason the code subtracts a minimum value calculated only on the strictly positive numbers and so you get negative values. It's probably my mindless paranoia....

@bobleesj
Copy link
Contributor Author

bobleesj commented Nov 7, 2024

@Fil158 Probably not? say, 1 - 1 * 0.1 is still positive? Assuming the data input is positive and the lift factor is less than 1 (you could double check this for me).

@Fil158
Copy link

Fil158 commented Nov 7, 2024

As I understand it, non-negative values are required to make NMF, not necessarily strictly positive and >1, so I think could work

@bobleesj
Copy link
Contributor Author

bobleesj commented Nov 7, 2024

Ok could you correct the function? It will prob break tests as expected.

Also have you looked into the overall algorithm implemented in matlap and described in the paper? Wondering about the current status now

@sbillinge
Copy link
Contributor

Let's code to the matlab. I don't see a reason why we don't just subtract the minimum. Other considerations are (a) code readability. I think subtracting the minimum is clearest here. (b) code performance. Probably not an issue, but it is usually a bad idea to take absolute values, especialiy in regression problems, because they are not differentiable in general and introduce instabilities when using gradient methods. Ran is a professional mathametician who is expert in regression problems, and makes some choices that are obvious to him without telling us why....in general we should go with what he did unless we have a very good reason to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants