MoEBERT code reading #31

long8v · 2022-05-23T06:58:24Z

MoEBERT

long8v · 2022-05-23T07:10:02Z

moebert/moe_layer.py

+    def _random_hash_list(self, vocab_size):
+        hash_list = torch.randint(low=0, high=self.num_experts, size=(vocab_size,))
+        return hash_list


아무 expert만 random으로 해도 잘된다.
https://arxiv.org/pdf/2106.04426.pdf

long8v · 2022-05-23T07:11:22Z

moebert/moe_layer.py

+    def _forward_gate_token(self, x):
+        bsz, seq_len, dim = x.size()


gate를 통하는 forward.
BERT니까 input의 shape은 bsz, seq_len, hid_dim

long8v · 2022-05-23T07:15:37Z

moebert/moe_layer.py

+        x = x.view(-1, dim)
+        logits_gate = self.gate(x)
+        prob_gate = F.softmax(logits_gate, dim=-1)
+        gate = torch.argmax(prob_gate, dim=-1)


bsz x seq_len, hid_dim으로 바꾸고 gate 통과시킴.
gate를 통과시킨 logits_gate은 bsz x seq_len, num_experts이고 이를 마지막 차원에서 softmax.
확률값 중 최대 index를 구함(=gate). gate는 bsz x seq_len 차원으로 이때 value는 최대값인 tensor의 index가 됨.

long8v · 2022-05-23T07:40:36Z

moebert/moe_layer.py

+        order = gate.argsort(0)
+        num_tokens = F.one_hot(gate, self.num_experts).gt(0).sum(0)
+        gate_load = num_tokens.clone()


gate를 0차원에서 내림차순으로 sort index를 구함. 그럼 expert index대로 sort가 구해질 것임
[1, 1, 2, 3] -> 첫번째 데이터는 1번째 expert 선택, 두번째는 0번째 expert를 구함
order = argsort -> [2, 2, 1, 0]
num_tokens는 각 expert에 몇개의 token이 할당됐는지 구함.

long8v · 2022-05-23T07:44:17Z

moebert/moe_layer.py

+        x = x[order]  # reorder according to expert number
+        x = x.split(num_tokens.tolist(), dim=0)  # a list of length self.num_experts


x의 shape은 bsz x seq_len, hid_dim
order의 shape은 bsz x seq_len임. x의 hid_dim을 expert index순으로 정렬함. 즉 같은 expert끼리 몰려있음
이를 split함수를 써서 각 expert당 몇개인지에 대한 tensor로 나눔.
그럼 (expert 0에 가야 하는 텐서들, expert 1로 가야 하는 텐서들 .... ) 이런 튜플이 나오게 됨.

long8v · 2022-05-23T07:50:13Z

moebert/moe_layer.py

+        P = prob_gate.mean(0)
+        temp = num_tokens.float()
+        f = temp / temp.sum(0, keepdim=True)
+        balance_loss = self.num_experts * torch.sum(P * f)


load balancing loss
어느 논문 껀지는 모르겠음

long8v · 2022-05-23T07:55:11Z

moebert/moe_layer.py

+        prob_gate = prob_gate.gather(dim=1, index=gate.unsqueeze(1))
+        prob_gate = prob_gate[order]
+        prob_gate = prob_gate.split(num_tokens.tolist(), dim=0)


prob_gate는 bsz x seq_len, num_expert차원인데, gate(bsz x seq_len)를 unsqueeze해서 (bsz x seq_len, 1) 차원으로 늘린 뒤 gather 해줌. 즉 max값으로 뽑힌 prob만 뽑는 연산임. gather한 뒤 prob_gate의 차원은 다시 (bsz x seq_len, 1)
위에 x도 나눠준 것처럼 gate에 대한 prob 텐서값도 expert index로 정렬해주고 split해줌.

long8v · 2022-05-23T07:57:30Z

moebert/moe_layer.py

+        def forward_expert(input_x, prob_x, expert_idx):
+            input_x = self.experts[expert_idx].forward(input_x)
+            input_x = input_x * prob_x
+            return input_x


input_x, prob_x, expert_idx가 주어지면 그 expert로 forward하고 이를 probability 로 곱함.

long8v · 2022-05-23T08:01:42Z

moebert/moe_layer.py

+        x = [forward_expert(x[i], prob_gate[i], i) for i in range(self.num_experts)]
+        x = torch.vstack(x)
+        x = x[order.argsort(0)]  # restore original order
+        x = x.view(bsz, seq_len, dim)
+
+        return x, balance_loss, gate_load


x는 이미 expert별로 나눠진 tuple임
각 expert들에 대해서 해당 expert로 가야하는 input x를 위를 forward_expert로 넘겨줌.
결과값은 리스트가 될 것인데 이를 수직으로 쌓아줌. 그러면 차원은 bsz x seq_len으로 됨 (순서는 expert idx)
이를 다시 order.argsort(0) 인덱스해서 원래 순서대로 재정렬해줌. (?)
다시 view로 원래 차원으로 바꿔줌.
forward의 최종 return 값은 x(=bsz, seq_len, dim), balance_loss, gate_load(=각 expert가 처리하는 토큰 개수)

long8v added 2 commits May 23, 2022 06:57

add moe bert

3fc5102

add moe bert

772c10a

long8v changed the title ~~add moe bert~~ MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation May 23, 2022

long8v added 2022Q1 MoE labels May 23, 2022

necesary files

92d8b6f

long8v commented May 23, 2022

View reviewed changes

long8v changed the title ~~MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation~~ MoEBERT code reading Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoEBERT code reading #31

MoEBERT code reading #31

long8v commented May 23, 2022 •

edited

Loading

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

long8v May 23, 2022

		def _forward_gate_token(self, x):
		bsz, seq_len, dim = x.size()

		x = x[order] # reorder according to expert number
		x = x.split(num_tokens.tolist(), dim=0) # a list of length self.num_experts

MoEBERT code reading #31

Are you sure you want to change the base?

MoEBERT code reading #31

Conversation

long8v commented May 23, 2022 • edited Loading

MoEBERT

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

long8v commented May 23, 2022 •

edited

Loading