Intel/Llama-2-70b-chat-hf-onnx-int4

Llama-2-70b-chat-hf-onnx-int4

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository of INT4 weight only quantization for the 70B fine-tuned model in ONNX format.

Note: Use of this model is governed by the Meta license. Please ensure you have accepted that License and got access to the FP32 model before downloading models here.

This INT4 model is generated with Intel® Neural Compressor's weight-only quantization method.

Model Detail	Description
Model Authors - Company	Intel
Date	August 29, 2023
Version	1
Type	Text Generation
Paper or Other Resources	-
License	https://ai.meta.com/resources/models-and-libraries/llama-downloads/
Questions or Comments	Community Tab

Intended Use	Description
Primary intended uses	You can use the raw model for text generation inference
Primary intended users	Anyone doing text generation inference
Out-of-scope uses	This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

Export to ONNX Model

The FP32 model is exported with meta-llama/Llama-2-70b-chat-hf:

optimum-cli export onnx --model meta-llama/Llama-2-70b-chat-hf --task text-generation ./llama2_70b_chat

Build ONNX Runtime

Build ONNX Runtime from resource to support MatMulWithQuantWeight op. You can refer to build-onnx-runtime-for-inferencing for more prerequisites.

git clone -b sub_byte_quant_zp https://github.com/microsoft/onnxruntime.git
cd onnxruntime
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --build_wheel

Run Quantization

The weight-only quantization cofiguration is as below:

dtype	group_size	scheme	algorithm
INT4	32	asym	RTN

Run INT4 weight-only quantization with Intel® Neural Compressor. We provide the key code below. For the complete quantization script, please refer to llama weight-only example.

from neural_compressor import quantization, PostTrainingQuantConfig

config = PostTrainingQuantConfig(
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={".*": {"weight": {"bits": 4, 
                                    "algorithm": ["RTN"], 
                                    "scheme": ["asym"], 
                                    "group_size": 32}}},)

q_model = quantization.fit(
    "/path/to/llama2_70b_chat/decoder_model.onnx", # FP32 model path
    config,
    calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-70b-chat-hf-onnx-int4/decoder_model.onnx") # INT4 model path

Evaluation

Operator Statistics

Below shows the operator statistics in the INT4 ONNX model:

Op Type	Total	INT4 weight	FP32
MatMul	641	561	80

Evaluation of perplexity

Evaluate the model with evaluation API of Intel® Extension for Transformers on lambada_openai task.

from intel_extension_for_transformers.evaluation.lm_eval import evaluate

model_path = "/path/to/Llama-2-70b-chat-hf-onnx-int4"
tokenizer = "Intel/Llama-2-70b-chat-hf-onnx-int4"
batch_size = 64
tasks=["lambada_openai"]

results = evaluate(
    model="hf-causal",
    model_args="pretrained=" + model_path + ",tokenizer="+ tokenizer,
    batch_size=batch_size,
    tasks=tasks,
    model_format="onnx"
)

Model	Model Size (GB)	lambada_openai acc	lambada_openai ppl
FP32	257	0.7543	2.6181
INT4	43	0.7510	2.6561

Intel
/

Llama-2-70b-chat-hf-onnx-int4

Llama-2-70b-chat-hf-onnx-int4

Export to ONNX Model

Build ONNX Runtime

Run Quantization

Evaluation

Datasets used to train Intel/Llama-2-70b-chat-hf-onnx-int4