-
Notifications
You must be signed in to change notification settings - Fork 405
Open
Description
I'm attempting to evaluate an OpenLlama model on a test dataset. When I use single element inference, it's considerably slow, so I'm trying to utilize batching for efficiency. However, during batch inference, I'm encountering a CUDA error.
Error Message
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [125,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [126,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [127,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with 'TORCH_USE_CUDA_DSA' to enable device-side assertions.
Code for Batch Inference
from tqdm import tqdm
def make_batch_inference(dataset, batch_size=8):
all_out = []
progress_bar = tqdm(range(0, len(dataset), batch_size), desc="Inferencing")
for start_idx in progress_bar:
end_idx = start_idx + batch_size
batch_questions = dataset['question'][start_idx:end_idx]
batch = tokenizer(batch_questions, return_tensors='pt', padding=True, truncation=True, max_length=512)
with torch.cuda.amp.autocast():
output_tokens = model.generate(
input_ids=batch["input_ids"].to("cuda:0"), max_new_tokens=2048
)
batch_out = [extract_first_sparql(tokenizer.decode(tokens, skip_special_tokens=True)) for tokens in output_tokens]
all_out.extend(batch_out)
return all_out
Loading data and dataset
test_data = load_data_from_file("data/kqapro_lcquad_test.json")
test_dataset = Dataset.from_dict(test_data)
results = make_batch_inference(test_dataset)
Additional Information
It's original a "openlm-research/open_llama_7b_v2" but I finetune it using peft. So I load the model using :
from peft import AutoPeftModelForCausalLM
device_map = {"": 0}
model = AutoPeftModelForCausalLM.from_pretrained(os.path.join(
output_dir, 'saved_model'), device_map=device_map, torch_dtype=torch.bfloat16)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
Any assistance on this issue would be greatly appreciated. Thank you in advance!
Metadata
Metadata
Assignees
Labels
No labels