Benchmarks for llama_cpp and other backends here #6373

BBC-Esq · 2024-03-28T18:02:36Z

BBC-Esq
Mar 28, 2024

Here's my initial testing. Feel free to contact me if you want the actual test scripts as I'm hesitant to past the entirety here! EDITED to include numbers from running 15 tests of all models now:

Testing Procedure:

Windows 10, RTX 4090, all libraries used are the most recent as of 3/28/2024 except torch==2.2.0 and nvidia-ml-py==12.535.133

Here are the relevant portions of the scripts used to test, with private information omitted and redundant code omitted (but noted) where appropriate:

BitsAndBytes

import torch
import gc
from transformers import AutoTokenizer, AutoModelForCausalLM, MistralForCausalLM, FalconForCausalLM, LlamaForCausalLM, PhiForCausalLM, StableLmForCausalLM, TextStreamer, BitsAndBytesConfig

# 4-bit settings
bnb_bfloat16_settings = {
    'tokenizer_settings': {
        'torch_dtype': torch.bfloat16,
    },
    'model_settings': {
        'torch_dtype': torch.bfloat16,
        'quantization_config': BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        ),
        'resume_download': True,
        'low_cpu_mem_usage': True,
    }
}

bnb_float16_settings = {
    'tokenizer_settings': {
        'torch_dtype': torch.float16,
    },
    'model_settings': {
        'torch_dtype': torch.float16,
        'quantization_config': BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
        ),
        'resume_download': True,
        'low_cpu_mem_usage': True,
    }
}


#8-bit settings
bnb_bfloat16_settings = {
    'tokenizer_settings': {
        'torch_dtype': torch.bfloat16,
    },
    'model_settings': {
        'torch_dtype': torch.bfloat16,
        'quantization_config': BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_threshold=6.0,
        ),
        'resume_download': True,
        'low_cpu_mem_usage': True,
    }
}

bnb_float16_settings = {
    'tokenizer_settings': {
        'torch_dtype': torch.float16,
    },
    'model_settings': {
        'torch_dtype': torch.float16,
        'quantization_config': BitsAndBytesConfig(
            load_in_8bit=True,
            llm_int8_threshold=6.0,
        ),
        'resume_download': True,
        'low_cpu_mem_usage': True,
    }
}


# See configuration_utils.py within the transformers library for more settings info
common_generate_settings = {
    'max_length': 4095,
    #'max_new_tokens': 500,
    'do_sample': False,
}

class CleanupMixin:
    def cleanup(self):
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'tokenizer'):
            del self.tokenizer
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

Then there is a separate class for each model tested, and here is one example:

class Llama_2_7b_chat_hf(CleanupMixin):  # float16 l;lama
    def __init__(self):
        model_name = f"PERSONAL PATH REDACTED"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, **bnb_float16_settings['tokenizer_settings'])
        self.model = AutoModelForCausalLM.from_pretrained(model_name, **bnb_float16_settings['model_settings'])

    def generate_response(self, user_message):
        system_message = "You are a helpful assistant who answers questions in a succinct fashion based on the contexts given to you."
        prompt = f"<<SYS>>\n{system_message}\n<</SYS>>\n\n[INST] {user_message} [/INST]"
        inputs = self.tokenizer(prompt, return_tensors="pt", return_attention_mask=True).to("cuda")
        generated_text = self.model.generate(**inputs, **common_generate_settings)
        full_response = self.tokenizer.decode(generated_text[0], skip_special_tokens=True)
        # Extracting only the model's response
        response_start_idx = full_response.find("[/INST]") + len("[/INST]")
        return full_response[response_start_idx:].strip()

The test scripts for Ctranslate2 and llama_cpp are all in one script, but testing bitsandbytes testing took 2 scripts. Here is the script that calls the script above. Obviously, you'd comment/uncomment the models you want to test. NOTE: This test is geared towards a RAG application - hence the long "user message" - because that is what my personal repository is all about:

import time
import pynvml
import warnings
from chat_templates import (
    #Dolphin_2_6_Mistral_7B,
    #Falcon_7b_instruct,
    #Gemma_2b_it,
    #Gemma_7b_it,
    #Hermes_2_Pro_Mistral_7b,
    Llama_2_13b_chat_hf,
    Llama_2_7b_chat_hf,
    Marx_3B_V3,
    #Microsoft_Phi_2,
    Mistral_7B_Instruct_v0_2,
    Neural_Chat_7b_v3_3,
    #Rocket_3B,
    SOLAR_10_7B_Instruct_v1_0,
    StableLM_Zephyr_3B,
)


pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

warnings.filterwarnings("ignore", message="Torch was not compiled with flash attention.")
warnings.filterwarnings("ignore", message="No module named 'triton'")
warnings.filterwarnings("ignore", module="xformers.*")
warnings.filterwarnings("ignore", module=".*bitsandbytes.*")
warnings.filterwarnings("ignore", module=".*transformers.*")
warnings.filterwarnings("ignore", message=".*Torch was not compiled with flash attention.*", module=".*transformers.models.llama.modeling_llama.*")
warnings.filterwarnings("ignore", message=".*Torch was not compiled with flash attention.*", module=".*transformers.models.mistral.modeling_mistral.*")

model_names = [
    #"Dolphin_2_6_Mistral_7B",
    #"Dolphin_2_6_Phi_2",
    #"Falcon_7b_instruct",
    #"Gemma_2b_it",
    #"Gemma_7b_it",
    #"Hermes_2_Pro_Mistral_7b",
    "Llama_2_13b_chat_hf",
    "Llama_2_7b_chat_hf",
    #"Marx_3B_V3",
    #"Microsoft_Phi_2",
    "Mistral_7B_Instruct_v0_2",
    "Neural_Chat_7b_v3_3",
    #"Phi_2_DPO",
    #"Rocket_3B",
    #"SOLAR_10_7B_Instruct_v1_0",
    #"StableLM_Zephyr_3B",
    #"phi_2_orange_v2",
]

def select_model_class(model_name):
    model_classes = {
        #"Dolphin_2_6_Mistral_7B": Dolphin_2_6_Mistral_7B,
        #"Dolphin_2_6_Phi_2": Dolphin_2_6_Phi_2,
        #"Falcon_7b_instruct": Falcon_7b_instruct,
        #"Gemma_2b_it": Gemma_2b_it,
        #"Gemma_7b_it": Gemma_7b_it,
        #"Hermes_2_Pro_Mistral_7b": Hermes_2_Pro_Mistral_7b,
        "Llama_2_13b_chat_hf": Llama_2_13b_chat_hf,
        "Llama_2_7b_chat_hf": Llama_2_7b_chat_hf,
        #"Marx_3B_V3": Marx_3B_V3,
        #"Microsoft_Phi_2": Microsoft_Phi_2,
        "Mistral_7B_Instruct_v0_2": Mistral_7B_Instruct_v0_2,
        "Neural_Chat_7b_v3_3": Neural_Chat_7b_v3_3,
        #"Phi_2_DPO": Phi_2_DPO,
        #"Rocket_3B": Rocket_3B,
        #"SOLAR_10_7B_Instruct_v1_0": SOLAR_10_7B_Instruct_v1_0,
        #"StableLM_Zephyr_3B": StableLM_Zephyr_3B,
        #"phi_2_orange_v2": phi_2_orange_v2,
    }

    return model_classes[model_name]()

cumulative_stats = {model_name: {"total_time": 0, "total_vram": 0, "count": 0} for model_name in model_names}

def generate_and_save_responses(user_message):
    responses = []
    summary_info = []
    for model_name in model_names:
        print(f"\033[92mProcessing with {model_name}...\033[0m")
        start_time = time.time()
        model = select_model_class(model_name)
        response = model.generate_response(user_message)

        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        vram_usage_before_cleanup = memory_info.used / 1024**2

        model.cleanup()

        end_time = time.time()
        execution_time = end_time - start_time

        print(f"\033[92m{model_name} processing completed in {execution_time:.2f} seconds.\033[0m")

        response_info = {
            "model_name": model_name,
            "execution_time": execution_time,
            "vram_usage_before_cleanup": vram_usage_before_cleanup,
            "response_text": f"Response from {model_name} (Execution Time: {execution_time:.2f} seconds, VRAM Before Cleanup: {vram_usage_before_cleanup:.2f} MB):\n\n{response}\n\n---\n\n"
        }
        responses.append(response_info)

        summary_info.append((model_name, execution_time, vram_usage_before_cleanup))

        cumulative_stats[model_name]["total_time"] += execution_time
        cumulative_stats[model_name]["total_vram"] += vram_usage_before_cleanup
        cumulative_stats[model_name]["count"] += 1

        time.sleep(5)

    sorted_responses = sorted(responses, key=lambda x: x["execution_time"])
    sorted_summary_info = sorted(summary_info, key=lambda x: x[1])

    sorted_response_texts = [info["response_text"] for info in sorted_responses]

    for response_info in sorted_responses:
        filename = f"results_{response_info['model_name']}.txt"
        with open(filename, "a", encoding="utf-8") as file:
            file.write(response_info["response_text"])

    header = "| Model Name                     | Execution Time (s) | VRAM Before Cleanup (MB) |"
    divider = "+" + "-" * (len(header) - 2) + "+"
    summary_text = [divider, header, divider]
    row_format = "| {:30} | {:18.2f} | {:24.2f} |"
    for info in sorted_summary_info:
        summary_text.append(row_format.format(*info))
    summary_text.append(divider)
    summary_text = "\n".join(summary_text)

    with open("results_tables.txt", "a", encoding="utf-8") as file:
        file.write(summary_text + "\n\n---\n\n")

def print_average_stats():
    for model_name, stats in cumulative_stats.items():
        if stats["count"] > 0:
            avg_time = stats["total_time"] / stats["count"]
            avg_vram = stats["total_vram"] / stats["count"]
            print(f"\033[91m{model_name} - Average Execution Time: {avg_time:.2f} seconds, Average VRAM Usage Before Cleanup: {avg_vram:.2f} MB\033[0m")

user_message = """
Only base your answer to the following question on the provided context/contexts accompanying this question. If you cannot answer based on the included context/contexts alone, please state so.

My question is: What is the deadline to hold a preliminary protective hearing in a dependency case?

And here are the relevant contexts to base your answer off of:

----------
Context 1 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf ----------

§ 6:21. Time limits—Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:21
© 2022 Thomson Reuters. No claim to original U.S. Government Works.
1
Ga. Juv. Prac. & Proc. § 6:21
Georgia Juvenile Practice and Procedure with Forms | August 2022 Update
Mark H. Murphy
Chapter 6. Dependency Proceedings
§ 6:21. Time limits—Preliminary protective hearing
If a child alleged to be dependent is removed from her home and is not returned home, the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care.1 If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.2

----------
Context 2 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf ----------

§ 6:35. Preliminary protective hearing—Dependency petition..., Ga. Juv. Prac. & Proc....
© 2022 Thomson Reuters. No claim to original U.S. Government Works.
1
Ga. Juv. Prac. & Proc. § 6:35
Georgia Juvenile Practice and Procedure with Forms | August 2022 Update
Mark H. Murphy
Chapter 6. Dependency Proceedings
§ 6:35. Preliminary protective hearing—Dependency petition if child not returned home
Under O.C.G.A. § 15-11-145(g), if the child is not released at the preliminary protective hearing, a petition for dependency shall be made and presented to the court within five days of such hearing. Westlaw. © 2022 Thomson Reuters. No Claim to Orig. U.S. Govt. Works. End of Document © 2022 Thomson Reuters. No claim to original U.S. Government Works.

----------
Context 3 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf ----------

§ 6:29. Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:29
© 2022 Thomson Reuters. No claim to original U.S. Government Works.
1
Ga. Juv. Prac. & Proc. § 6:29
Georgia Juvenile Practice and Procedure with Forms | August 2022 Update
Mark H. Murphy
Chapter 6. Dependency Proceedings
§ 6:29. Preliminary protective hearing
The preliminary protective hearing is essentially a probable cause hearing designed to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship. It must be held promptly after a child is removed from the home and is intended to provide due process to the parties involved. The hearing must occur no later than 72 hours after
"""

for _ in range(15):
    generate_and_save_responses(user_message)

print_average_stats()

Ctranslate2

import os
import sys
import ctranslate2
import sentencepiece as spm
import time
import pynvml
import warnings
import gc

user_prompt = """
Only base your answer to the following question on the provided context/contexts accompanying this question. If you cannot answer based on the included context/contexts alone, please state so.

My question is: What is the deadline to hold a preliminary protective hearing in a dependency case?

And here are the relevant contexts to base your answer off of:

----------
Context 1 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf ----------

§ 6:21. Time limits—Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:21
© 2022 Thomson Reuters. No claim to original U.S. Government Works.
1
Ga. Juv. Prac. & Proc. § 6:21
Georgia Juvenile Practice and Procedure with Forms | August 2022 Update
Mark H. Murphy
Chapter 6. Dependency Proceedings
§ 6:21. Time limits—Preliminary protective hearing
If a child alleged to be dependent is removed from her home and is not returned home, the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care.1 If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.2

----------
Context 2 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf ----------

§ 6:35. Preliminary protective hearing—Dependency petition..., Ga. Juv. Prac. & Proc....
© 2022 Thomson Reuters. No claim to original U.S. Government Works.
1
Ga. Juv. Prac. & Proc. § 6:35
Georgia Juvenile Practice and Procedure with Forms | August 2022 Update
Mark H. Murphy
Chapter 6. Dependency Proceedings
§ 6:35. Preliminary protective hearing—Dependency petition if child not returned home
Under O.C.G.A. § 15-11-145(g), if the child is not released at the preliminary protective hearing, a petition for dependency shall be made and presented to the court within five days of such hearing. Westlaw. © 2022 Thomson Reuters. No Claim to Orig. U.S. Govt. Works. End of Document © 2022 Thomson Reuters. No claim to original U.S. Government Works.

----------
Context 3 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf ----------

§ 6:29. Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:29
© 2022 Thomson Reuters. No claim to original U.S. Government Works.
1
Ga. Juv. Prac. & Proc. § 6:29
Georgia Juvenile Practice and Procedure with Forms | August 2022 Update
Mark H. Murphy
Chapter 6. Dependency Proceedings
§ 6:29. Preliminary protective hearing
The preliminary protective hearing is essentially a probable cause hearing designed to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship. It must be held promptly after a child is removed from the home and is intended to provide due process to the parties involved. The hearing must occur no later than 72 hours after
"""

def main():
    # List of model configurations
    models = [
        {
            "model_dir": r"[PRIVATE PATH REDACTED]",
            "build_prompt": build_prompt_solar_10_7b_instruct_v1_0
        },
        {
            "model_dir": r"[PRIVATE PATH REDACTED]",
            "build_prompt": build_prompt_neural_chat_7b_v3_3
        },
        {
            "model_dir": r"[PRIVATE PATH REDACTED]",
            "build_prompt": build_prompt_llama_2_7b_chat
        },
        {
            "model_dir": r"[PRIVATE PATH REDACTED]",
            "build_prompt": build_prompt_mistral_7b_instruct_v0_2
        },
        {
            "model_dir": r"D:\Scripts\chat-ctranslate2\Llama-2-13b-chat-hf-ct2-int8",
            "build_prompt": build_prompt_llama_2_13b_chat
        },
    ]

    context_length = 4095
    max_generation_length = 512
    max_prompt_length = context_length - max_generation_length

    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    warnings.filterwarnings("ignore", module="pynvml")

    results = {}

    for model_config in models:
        model_dir = model_config["model_dir"]
        build_prompt_func = model_config["build_prompt"]

        model_name = os.path.basename(model_dir)
        print(f"\033[32mLoading the model: {model_name}...\033[0m")
        intra_threads = max(os.cpu_count() - 4, os.cpu_count())
        generator = ctranslate2.Generator(model_dir, device="cuda", compute_type="int8", intra_threads=intra_threads)
        sp = spm.SentencePieceProcessor(os.path.join(model_dir, "tokenizer.model"))

        model_results = []

        for _ in range(15):
            start_time = time.time()
            dialog = [{"role": "user", "content": user_prompt}]
            prompt_tokens = build_prompt_func(sp, dialog)
            step_results = generator.generate_tokens(
                prompt_tokens,
                max_batch_size=1000,
                batch_type="tokens",
                max_length=max_generation_length,
                sampling_temperature=0.1,
                sampling_topk=20,
                sampling_topp=1,
            )

            memory_info_peak = pynvml.nvmlDeviceGetMemoryInfo(handle)
            vram_usage_peak = memory_info_peak.used / 1024**2

            print("", flush=True)
            text_output = ""
            num_generated_tokens = 0
            for word in generate_words(sp, step_results):
                if text_output:
                    word = " " + word
                print(word, end="", flush=True)
                text_output += word
                num_generated_tokens += 1
            print("")

            end_time = time.time()
            response_time = end_time - start_time

            model_results.append({
                "response_time": response_time,
                "peak_vram_usage": vram_usage_peak
            })

        results[model_name] = model_results

        del generator
        del sp
        gc.collect()

        time.sleep(2)

    pynvml.nvmlShutdown()

    print("\nAverage Results:")
    for model_name, model_results in results.items():
        avg_response_time = sum(result['response_time'] for result in model_results) / len(model_results)
        avg_peak_vram_usage = sum(result['peak_vram_usage'] for result in model_results) / len(model_results)
        print(f"Model: {model_name}")
        print(f"Average Response Time: {avg_response_time:.2f} seconds")
        print(f"Average Peak VRAM Usage: {avg_peak_vram_usage:.2f} MB")
        print()

def generate_words(sp, step_results):
    tokens_buffer = []
    for step_result in step_results:
        is_new_word = step_result.token.startswith("▁")
        if is_new_word and tokens_buffer:
            word = sp.decode(tokens_buffer)
            if word:
                yield word
            tokens_buffer = []
        tokens_buffer.append(step_result.token_id)
    if tokens_buffer:
        word = sp.decode(tokens_buffer)
        if word:
            yield word

def build_prompt_solar_10_7b_instruct_v1_0(sp, dialog):
    user_prompt = dialog[0]["content"]
    system_message = "You are a helpful assistant who answers questions in a succinct fashion based on the contexts given to you."
    prompt = f"""<s>### System:\n{system_message}\n### User:\n{user_prompt}\n### Assistant:</s>"""
    dialog_tokens = sp.encode_as_pieces(prompt)
    return dialog_tokens

def build_prompt_neural_chat_7b_v3_3(sp, dialog):
    system_prompt = "You are a helpful assistant who answers questions in a succinct fashion based on the contexts given to you."
    user_prompt = dialog[0]["content"]
    prompt = f"### System:\n{system_prompt}\n### User:\n{user_prompt}\n### Assistant: "
    dialog_tokens = sp.encode_as_pieces(prompt)
    return dialog_tokens

def build_prompt_llama_2_7b_chat(sp, dialog):
    user_prompt = dialog[0]["content"]
    system_message = "You are a helpful assistant who answers questions in a succinct fashion based on the contexts given to you."
    prompt = f"<<SYS>>\n{system_message}\n<</SYS>>\n\n[INST] {user_prompt} [/INST]"
    dialog_tokens = sp.encode_as_pieces(prompt)
    return dialog_tokens

def build_prompt_llama_2_13b_chat(sp, dialog):
    user_prompt = dialog[0]["content"]
    system_message = "You are a helpful assistant who answers questions in a succinct fashion based on the contexts given to you."
    prompt = f"<<SYS>>\n{system_message}\n<</SYS>>\n\n[INST] {user_prompt} [/INST]"
    dialog_tokens = sp.encode_as_pieces(prompt)
    return dialog_tokens
    
def build_prompt_mistral_7b_instruct_v0_2(sp, dialog):
    user_prompt = dialog[0]["content"]
    prompt = f"<s>[INST] {user_prompt} [/INST]</s>\n"
    dialog_tokens = sp.encode_as_pieces(prompt)
    return dialog_tokens

if __name__ == "__main__":
    main()

Llama_cpp

This script only processed one model at a time and was a royal pain in the ass to run manually multiples times...I ran into problems making it batch process and give reliable outputs for some reason so was forced to do it this way. NOTE: I had to change the structure of the "prompt" to remove all newlines to get the model to respond properly, just FYI:

import llama_cpp
import time
import pynvml

model_path = r"D:\Scripts\chat-llama-cpp\SOLAR-10.7B-Instruct-v1.0-GGUF\solar-10.7b-instruct-v1.0.Q8_0.gguf"

model = llama_cpp.Llama(
    model_path=model_path,
    chat_format="solar",
    main_gpu=0,
    n_gpu_layers=-1,
    seed=-1,
    n_ctx=4095,
    use_mmap=True,
    use_mlock=False,
    n_batch=512,
    offload_kqv=True,
)

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

def generate_text(prompt, max_tokens=512, temperature=0.1, top_p=0.95, top_k=1):
    start_time = time.time()
    response = model.__call__(
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        # stop=["Q:", "\n"]
    )
    end_time = time.time()
    processing_time = end_time - start_time
    generated_text = response['choices'][0]['text']
    print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m")
    return generated_text

if __name__ == "__main__":
    prompt = """
Only base your answer to the following question on the provided context/contexts accompanying this question. If you cannot answer based on the included context/contexts alone, please state so. || My question is: What is the deadline to hold a preliminary protective hearing in a dependency case? || And here are the relevant contexts to base your answer off of: || Context 1 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf | § 6:21. Time limits—Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:21 © 2022 Thomson Reuters. No claim to original U.S. Government Works. 1 Ga. Juv. Prac. & Proc. § 6:21 Georgia Juvenile Practice and Procedure with Forms | August 2022 Update Mark H. Murphy Chapter 6. Dependency Proceedings § 6:21. Time limits—Preliminary protective hearing If a child alleged to be dependent is removed from her home and is not returned home, the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care.1 If the 72-hour time period expires on a weekend or legal holiday, then the court is required to hold the hearing on the next day which is not a weekend or legal holiday.2 || Context 2 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf | § 6:35. Preliminary protective hearing—Dependency petition..., Ga. Juv. Prac. & Proc.... © 2022 Thomson Reuters. No claim to original U.S. Government Works. 1 Ga. Juv. Prac. & Proc. § 6:35 Georgia Juvenile Practice and Procedure with Forms | August 2022 Update Mark H. Murphy Chapter 6. Dependency Proceedings § 6:35. Preliminary protective hearing—Dependency petition if child not returned home Under O.C.G.A. § 15-11-145(g), if the child is not released at the preliminary protective hearing, a petition for dependency shall be made and presented to the court within five days of such hearing. Westlaw. © 2022 Thomson Reuters. No Claim to Orig. U.S. Govt. Works. End of Document © 2022 Thomson Reuters. No claim to original U.S. Government Works. || Context 3 | From File: Georgia Juvenile Law Practice and Procedure - August 2022.pdf | § 6:29. Preliminary protective hearing, Ga. Juv. Prac. & Proc. § 6:29 © 2022 Thomson Reuters. No claim to original U.S. Government Works. 1 Ga. Juv. Prac. & Proc. § 6:29 Georgia Juvenile Practice and Procedure with Forms | August 2022 Update Mark H. Murphy Chapter 6. Dependency Proceedings § 6:29. Preliminary protective hearing The preliminary protective hearing is essentially a probable cause hearing designed to provide prompt judicial oversight of state intervention into the constitutionally protected parent-child relationship. It must be held promptly after a child is removed from the home and is intended to provide due process to the parties involved. The hearing must occur no later than 72 hours after.
"""
    response = generate_text(prompt)
    print(response)

memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
max_vram_used = memory_info.used / 1024**2
print(f"\033[92mMax VRAM used: {max_vram_used:.2f} MB\033[0m")

In conclusion, this is a hobby of mine and I'm not a programmer by trade. However, I've tried to control for as many constants as possible despite varying APIs between libraries - e.g. Ctranslate2 uses "do_sample=false" while I couldn't find anything identical in llama_cpp...

Feedback is always welcome. Constructive criticism is welcome as my goal is to have accurate testing not feed my ego about who's the best. Thanks!

phymbert · 2024-03-28T18:17:47Z

phymbert
Mar 28, 2024
Collaborator

Thanks for sharing, we have continous performance in place:

server: continuous performance monitoring and PR comment #6283

Feel free to contribute. Could you please share your scripts and dataset ? Have you also tested vLLM ?

0 replies

BBC-Esq · 2024-03-28T19:19:54Z

BBC-Esq
Mar 28, 2024
Author

Thanks for the feedback. I'm in the process of testing all permutations 15 times instead of 3, which leads to more reliable numbers, but the general trends are there. Very impressive. I had thought that ctranslate2 was the fastest. Unfortunately, I'm not familiar with vllm and don't have the time to educate myself (as a non-programmer by trade) on a new backend. It took me awhile to get llama_cpp going, for example...If you or anyone have any starter scripts that'd help.

I'm new to software testing and am using my own personal benchmarks since every repository has their different ones, but controlling for the same settings best I can...Will share my scripts once I update the graph with 15 tests each.

Was looking forward to testing Vulkan (and other llama_cpp backends) in addition, but unfortunately seems like it's borked with a certain commit with plans to fix it in the near future.

0 replies

BBC-Esq · 2024-03-28T20:41:41Z

BBC-Esq
Mar 28, 2024
Author

Updated, see my first post.

0 replies

hayleyhu · 2024-11-04T23:23:11Z

hayleyhu
Nov 4, 2024

is python client having the same performance as a cpp client for llama.cpp?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarks for llama_cpp and other backends here #6373

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Benchmarks for llama_cpp and other backends here #6373

Uh oh!

Uh oh!

BBC-Esq Mar 28, 2024

Replies: 4 comments

Uh oh!

phymbert Mar 28, 2024 Collaborator

Uh oh!

BBC-Esq Mar 28, 2024 Author

Uh oh!

BBC-Esq Mar 28, 2024 Author

Uh oh!

hayleyhu Nov 4, 2024

BBC-Esq
Mar 28, 2024

phymbert
Mar 28, 2024
Collaborator

BBC-Esq
Mar 28, 2024
Author

BBC-Esq
Mar 28, 2024
Author

hayleyhu
Nov 4, 2024