Skip to content

boostcampaitech7/level2-mrc-nlp-08

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

94 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ”ฅ ๋„ค์ด๋ฒ„ AI Tech NLP 8์กฐ The AIluminator ๐ŸŒŸ

Level 2 Project - Open-Domain Question Answering

๋ชฉ์ฐจ

  1. ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ
  2. Installation and Quick Start
  3. ํ”„๋กœ์ ํŠธ ์ง„ํ–‰
  4. ๋ฆฌ๋”๋ณด๋“œ ๊ฒฐ๊ณผ

1. ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

(1) ์ฃผ์ œ ๋ฐ ๋ชฉํ‘œ

  • ๋ถ€์ŠคํŠธ์บ ํ”„ AI Tech NLP ํŠธ๋ž™ level 2 MRC
  • ์ฃผ์ œ : ODQA (Open-Domain Question Answering)
    ODQA ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ด ์งˆ๋ฌธ์— ๋งž๋Š” ์ •๋‹ต์„ ์˜ˆ์ธก

(2) ํ‰๊ฐ€์ง€ํ‘œ

  • ์ฃผ ํ‰๊ฐ€์ง€ํ‘œ : Exact Match (๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๋‹ต์ด ์ •ํ™•ํ•˜๊ฒŒ ์ผ์น˜ํ•  ๋•Œ๋งŒ ์ ์ˆ˜๊ฐ€ ์ฃผ์–ด์ง)
  • ์ฐธ๊ณ ์šฉ : F1 score (๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๋‹ต์— ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์œผ๋ฉด ๋ถ€๋ถ„์ ์ˆ˜๊ฐ€ ์ฃผ์–ด์ง)

(3) ๊ฐœ๋ฐœ ํ™˜๊ฒฝ

  • GPU : Tesla V100 * 4

(4) ํ˜‘์—… ํ™˜๊ฒฝ

Tool Description
GitHub - Task ๋ณ„ issue ์ƒ์„ฑ
- ๋‹ด๋‹นํ•œ issue์— ๋Œ€ํ•œ branch ์ƒ์„ฑ ํ›„ PR & main์— merge
Slack - GitHub๊ณผ ์—ฐ๋™ํ•ด์„œ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์— ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๋‚ด์šฉ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ™•์ธ
- ํ—ˆ๋“ค์„ ์ด์šฉํ•œ ํšŒ์˜ ๋ฐ ๊ฒฐ๊ณผ ๊ณต์œ 
Notion - ํƒ€์ž„๋ผ์ธ ์ •๋ฆฌ
- ์นธ๋ฐ˜๋ณด๋“œ๋ฅผ ์ด์šฉํ•œ task ๊ด€๋ฆฌ
Zoom - ์ง„ํ–‰์ƒํ™ฉ ๊ณต์œ 
WandB - Sweep์„ ํ†ตํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”

(5) ํŒ€์› ์†Œ๊ฐœ

๊น€๋™ํ•œ ๊น€์„ฑํ›ˆ ๊น€์ˆ˜์•„ ๊น€ํ˜„์šฑ ์†ก์ˆ˜๋นˆ ์‹ ์ˆ˜ํ™˜
Github Github Github Github Github Github
Member Team Role
๊น€๋™ํ•œ Data, Model - Extraction Reader Modeling(ํ•™์Šต ๋ฐ ์ถ”๋ก )
- Extraction Reader ์•„ํ‚คํ…์ฒ˜ ์ˆ˜์ •(CNN Head)
- Sparse Passage Retrieval(Retrieval ๊ฒฐ๊ณผ ๋ถ„์„)
- EDA(๋ฐ์ดํ„ฐ ํ† ํฐ ๊ฐœ์ˆ˜ ๋ถ„ํฌ ๋ถ„์„)
๊น€์„ฑํ›ˆ Data, Model - Code Modularization
- Sparse/Dense Passage Rrieval(๊ตฌํ˜„ ๋ฐ ์‹คํ—˜)
- Generation Reader Modeling(LLM ํ•™์Šต ๋ฐ ์‹คํ—˜)
- ML Pipeline
๊น€์ˆ˜์•„ Model - Question augmentation(KoBART)
- Experimentation(top-k)
๊น€ํ˜„์šฑ Data, Model - Generation Reader Modeling(ํ•™์Šต ๋ฐ ์ถ”๋ก )
- EDA(๋ฐ์ดํ„ฐ ํ…์ŠคํŠธ ํ€„๋ฆฌํ‹ฐ ๋ถ„์„)
์†ก์ˆ˜๋นˆ Model - Extraction Reader Modeling(ํ•™์Šต ๋ฐ ์ถ”๋ก )
- Experimentation(์‹คํ—˜ ๋ชจ๋ธ ๋ชฉ๋ก ๊ตฌ์„ฑ ๋ฐ ๊ฒฐ๊ณผ ์ •๋ฆฌ)
- Logging & HyperParameter Tuning(Wandb Sweep)
- Ensemble(์•™์ƒ๋ธ” ์ฝ”๋“œ ์ž‘์„ฑ, ๋ชจ๋ธ ์„ ์ •์„ ์œ„ํ•œ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„ ์ฝ”๋“œ ์ž‘์„ฑ)
์‹ ์ˆ˜ํ™˜ Data, Model Sparse Passage Retrieval(BM25 ์„ฑ๋Šฅ ๊ฐœ์„ ),ย ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(Data Cleaning)

2. Installation and Quick Start

Step 1. ํ”„๋กœ์ ํŠธ์— ํ•„์š”ํ•œ ๋ชจ๋“  dependencies๋Š” requirements.txt์— ์žˆ๊ณ , ์ด์— ๋Œ€ํ•œ ๊ฐ€์ƒํ™˜๊ฒฝ์„ ์ƒ์„ฑํ•ด์„œ ํ”„๋กœ์ ํŠธ๋ฅผ ์‹คํ–‰

# ๊ฐ€์ƒํ™˜๊ฒฝ ๋งŒ๋“ค๊ธฐ
$ python -m venv .venv

# ๊ฐ€์ƒํ™˜๊ฒฝ ์ผœ๊ธฐ
$ . .venv/bin/activate

# ์ œ๊ณต๋˜๋Š” ์„œ๋ฒ„ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ์„ ํƒ์  ์‚ฌ์šฉ
$ export TMPDIR=/data/ephemeral/tmp 
$ mkdir -p $TMPDIR

# ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜
$ pip install --upgrade pip
$ pip install -r requirements.txt

Step 2. Pre Processing ์‹คํ–‰

# ์ž‘์—…ํ™˜๊ฒฝ ๋ณ€๊ฒฝ
$ cd pre_process

# ๋‹ค์Œ ์ฃผํ”ผํ„ฐ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ KorQuAD 1.0 ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
$ data_augment_korquadv1.ipynb

# ๋‹ค์Œ ์ฃผํ”ผํ„ฐ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ AIHub ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
$ data_augment_aihub.ipynb

# ๋‹ค์Œ ์ฃผํ”ผํ„ฐ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ DPR retrieval์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๋งŒ๋“ค๊ธฐ
$ generate_DPR_dataset_korquad.ipynb

Step 3. DPR ๋ชจ๋ธ ํ•™์Šต

utils/arguments_dpr.py ์—์„œ DPR ํ•™์Šต์„ ์œ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝ

  • model : ์›ํ•˜๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
  • train_data : generate_DPR_dataset_korquad.ipynb ์—์„œ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ
  • valid_data : generate_DPR_dataset_korquad.ipynb ์—์„œ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ
  • q_output_path : Query embedding ๋ชจ๋ธ ์ €์žฅํ•  ๊ฒฝ๋กœ
  • c_output_path : Context embedding ๋ชจ๋ธ ์ €์žฅํ•  ๊ฒฝ๋กœ
# ./level2-mrc-nlp-08 ๊ฒฝ๋กœ์—์„œ ์‹คํ–‰
$ python train_dpr.py

Step 4. Retrieval๋ฅผ ์œ„ํ•œ ์‚ฌ์ „์ฒ˜๋ฆฌ ์ง„ํ–‰

database/python get_embedding_vec.csv : BM25 ๋ชจ๋ธ ๋ฐ DPR์˜ embedding vector ์ €์žฅ

  • model : ํ•™์Šต๋œ context embedding ๋ชจ๋ธ ๊ฒฝ๋กœ
  • wiki_path : Wiki.doc ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ
  • valid_data : Query-Passage ์Œ ๋ฐ์ดํ„ฐ์˜ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ
  • save_path : Embedding vector ์ €์žฅ ๊ฒฝ๋กœ

test_retrieval.py

  • model : ํ•™์Šต๋œ query embedding ๋ชจ๋ธ ๊ฒฝ๋กœ
  • valid_data : Query-Passage ์Œ ๋ฐ์ดํ„ฐ์˜ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ
  • faiss_path : database/python get_embedding_vec.csv ์—์„œ ์‹คํ–‰ํ•œ save_path ๊ฒฝ๋กœ
  • bm25_path : database/python get_embedding_vec.csv ์—์„œ ์‹คํ–‰ํ•œ save_path ๊ฒฝ๋กœ
  • context_path : database/python get_embedding_vec.csv ์—์„œ ์‹คํ–‰ํ•œ save_path ๊ฒฝ๋กœ

test_retrieval_inference.py

  • model : ํ•™์Šต๋œ query embedding ๋ชจ๋ธ ๊ฒฝ๋กœ
  • test_dataset : Query-Passage ์Œ ๋ฐ์ดํ„ฐ์˜ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ
  • faiss_path : ์œ„์™€ ๋™์ผ
  • bm25_path : ์œ„์™€ ๋™์ผ
  • context_path : ์œ„์™€ ๋™์ผ
# ์ž‘์—…ํ™˜๊ฒฝ ๋ณ€๊ฒฝ
$ cd database

# ๋‹ค์Œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ embedding vector ์ถ”์ถœ
$ python get_embedding_vec.csv

# BM25 ๋ฐ DPR ์„ฑ๋Šฅ ํ™•์ธ
$ cd ..
$ python test_retrieval.py

# Inference ์‹œ ์‚ฌ์šฉํ•  retireve ๋œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
$ python test_retrieval_inference.py

Step 5. Reader ํ•™์Šต

utils/arguments_extraction_reader.py์—์„œ extracion based model ํ•™์Šต์„ ์œ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝ

  • model_name_or_path : ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
  • dataset_name : Query-Passage ์Œ ๋ฐ์ดํ„ฐ๋‚˜ ์ฆ๊ฐ•๋œ ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋กœ ๋ณ€๊ฒฝ
  • output_dir : ํ•™์Šต๋œ ๋ชจ๋ธ ๋ฐ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ €์žฅ ๊ฒฝ๋กœ
# ๋‹ค์Œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ extraction based model ํ•™์Šต
$ python train_extraction_reader.py

# ํ”„๋กœ์ ํŠธ ๋•Œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์ง€๋งŒ generation based model ํ•™์Šต, ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝ์€ ์œ„์™€ ๋™์ผ
$ python train_generation_reader_Seq2SeqLM,.py
$ python train_generation_reader_CausalLM,.py

Step 6. Inference ์‹คํ–‰

utils/arguments_inference.py์—์„œ inference ํ•  extraction based ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝ

  • model_name_or_path : ํ•™์Šต์ด ์™„๋ฃŒ๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
  • output_dir : Inference ๊ฒฐ๊ณผ ์ €์žฅ ๊ฒฝ๋กœ
# ์ฝ”๋“œ 50๋ฒˆ์งธ ์ค„์—์„œ retireve ๋œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒฝ๋กœ ์›ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ ์‚ฌ์šฉ
$ python inference.py

Step 7. ์•™์ƒ๋ธ” ์‹คํ–‰

# train_extraction_reader ์‹คํ–‰ ์‹œ ์ƒ์„ฑ๋˜๋Š” predictions.json ๊ฐ’๋“ค๋กœ ์ƒ๊ด€๋ถ„์„ ๋ถ„์„
$ correlation_exp.ipynb

# ์ƒ๊ด€๋ถ„์„์„ ํ†ตํ•ด ์‚ฌ์šฉํ•  ๋ชจ๋ธ ์„ ํƒ ๋˜์—ˆ๋‹ค๋ฉด ๊ทธ ๋ชจ๋ธ๋“ค๋กœ inference ์‹œ ์ƒ์„ฑ๋œ nbest_predictions.json ํŒŒ์ผ๋“ค๋กœ ์•™์ƒ๋ธ” ์ง„ํ–‰ / ๋‘๊ฐ€์ง€ ๋ฒ„์ „ ๋ชจ๋‘ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
$ ensemble_v1.ipynb
$ ensemble_v2.ipynb

3. ํ”„๋กœ์ ํŠธ ์ง„ํ–‰

Task Task Description
EDA ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด ์ค‘๋ณต ๋ฐ์ดํ„ฐ ํ™•์ธ, ํ† ํฐ ๊ฐœ์ˆ˜ ๋ถ„ํฌ, ๋ฐ์ดํ„ฐ ํ€„๋ฆฌํ‹ฐ ์ฒดํฌ ๋“ฑ ์‹œ๊ฐํ™” ๋ฐ ๋ถ„์„
๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ Reader Model๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์ ํ•ฉํ•œ pre-trained model ์‹คํ—˜ ๋ฐ ์„ ์ •
Retrieval BM25, DPR Retrieval ๊ธฐ๋ฒ• ๊ตฌํ˜„ ๋ฐ ์‹คํ—˜
Reader Model Transfer Learning
CNN Head
Cleaning
Post-Processing ํ›„์ฒ˜๋ฆฌ
๋ชจ๋ธ ๋‹ค์–‘์„ฑ ์ฒดํฌ
์•™์ƒ๋ธ”

Post-Processing

Inference ํ›„์ฒ˜๋ฆฌ

  • ํ†ตํ•ฉ๋ชจ๋ธ์ด ์ตœ์„ ์˜ ๋‹ต์„ ๋„์ถœํ•  ๋•Œ, ๋ฌธ์„œ ๋‚ด์— ๋‹ค๋ฅธ ์œ„์น˜์— ์žˆ๋Š” ๊ฐ™์€ ๋‹จ์–ด์ž„์—๋„ start logit๊ณผ end logit ๊ฐ’์ด ๋‹ฌ๋ผ ๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ํ™•๋ฅ ์ด ๋ถ„๋ฆฌ๋˜์–ด ๊ณ„์‚ฐ๋˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜์—ฌ Inference ํ›„์ฒ˜๋ฆฌ ์ง„ํ–‰
  • ํ…์ŠคํŠธ๊ฐ€ ๋™์ผํ•œ ๊ฒฝ์šฐ ํ™•๋ฅ ์„ ํ•ฉ์‚ฐํ•ด ์ด ํ™•๋ฅ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€์„ ์„ ํƒํ•˜๋Š” ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ ์šฉํ•จ

๋ชจ๋ธ ๋‹ค์–‘์„ฑ ์ฒดํฌ

  • ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์„ ๋ฒกํ„ฐ๋กœ ์น˜ํ™˜ํ•˜์—ฌ(์˜ค๋‹ต์„ 1, ์ •๋‹ต์„ 0) ๋ชจ๋ธ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•จ์œผ๋กœ์จ, ๋ชจ๋ธ๋“ค์ด ์ƒํ˜ธ๋ณด์™„์ ์ธ ์ž‘์šฉ์„ ํ•˜๋„๋ก ํ•จ
  • model1, model2์ด ์žˆ๊ณ  ๊ฐ๊ฐ 5๊ฐœ๋ฅผ ์˜ˆ์ธกํ–ˆ๋‹ค๊ณ  ํ•˜๋ฉด ๋‘ ๋ฒกํ„ฐ [1, 0, 0, 1, 1], [0, 1, 1, 0, 1]์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋„์ถœ

์•™์ƒ๋ธ”

  • ๋‹ค์–‘ํ•˜๊ฒŒ ํ›ˆ๋ จ๋œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ชจ๋ธ๋“ค์ด ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜์—ฌ ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์•™์ƒ๋ธ”์„ ๋„์ž…

  • ํ™•๋ฅ ํ•ฉ์„ ํ†ตํ•ด soft voting (์•ž์„œ ์ด์•ผ๊ธฐ ํ•œ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ์‹๊ณผ ๋™์ผ)

    • ์•™์ƒ๋ธ” ํ•  ๋ชจ๋ธ๋“ค์˜ ๋‹ต๋ณ€-ํ™•๋ฅ  ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๊ฐ™์€ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ ๋“ค์„ sum
    • ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์˜ ๋‹ต๋ณ€์„ ์ •๋‹ต์œผ๋กœ ์ฑ„ํƒ
  • ๋‹ค์ˆ˜๊ฒฐ (majority voting)

    • ์•™์ƒ๋ธ” ํ•  ๋ชจ๋ธ๋“ค์˜ ๋‹ต๋ณ€-ํ™•๋ฅ  ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๊ฐ€์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๋‹ต๋ณ€์„ ์ •๋‹ต์œผ๋กœ ์ฑ„ํƒ
    • ๋งŒ์•ฝ ๋™๋ฅ ์˜ ๋‹ต๋ณ€์ด ์žˆ๋‹ค๋ฉด, ์•ž์„  ๋‹ค์ˆ˜๊ฒฐ ๊ฒฐ๊ณผ์™€ ์ƒ๊ด€์—†์ด ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ๋‹ต๋ณ€์„ ์ฑ„ํƒ

4. ๋ฆฌ๋”๋ณด๋“œ ๊ฒฐ๊ณผ

Publicย Leader Board ์ˆœ์œ„

Private Leader Board ์ˆœ์œ„

About

level2-mrc-nlp-08 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published