Edit model card

Llama-2-70b-chat-hf-onnx-int4

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository of INT4 weight only quantization for the 70B fine-tuned model in ONNX format.

Note: Use of this model is governed by the Meta license. Please ensure you have accepted that License and got access to the FP32 model before downloading models here.

This INT4 model is generated with Intel® Neural Compressor's weight-only quantization method.

Model Detail Description
Model Authors - Company Intel
Date August 29, 2023
Version 1
Type Text Generation
Paper or Other Resources -
License https://ai.meta.com/resources/models-and-libraries/llama-downloads/
Questions or Comments Community Tab
Intended Use Description
Primary intended uses You can use the raw model for text generation inference
Primary intended users Anyone doing text generation inference
Out-of-scope uses This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

Export to ONNX Model

The FP32 model is exported with meta-llama/Llama-2-70b-chat-hf:

optimum-cli export onnx --model meta-llama/Llama-2-70b-chat-hf --task text-generation ./llama2_70b_chat

Build ONNX Runtime

Build ONNX Runtime from resource to support MatMulWithQuantWeight op. You can refer to build-onnx-runtime-for-inferencing for more prerequisites.

git clone -b sub_byte_quant_zp https://github.com/microsoft/onnxruntime.git
cd onnxruntime
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --build_wheel

Run Quantization

The weight-only quantization cofiguration is as below:

dtype group_size scheme algorithm
INT4 32 asym RTN

Run INT4 weight-only quantization with Intel® Neural Compressor. We provide the key code below. For the complete quantization script, please refer to llama weight-only example.

from neural_compressor import quantization, PostTrainingQuantConfig

config = PostTrainingQuantConfig(
    approach="weight_only",
    calibration_sampling_size=[8],
    op_type_dict={".*": {"weight": {"bits": 4, 
                                    "algorithm": ["RTN"], 
                                    "scheme": ["asym"], 
                                    "group_size": 32}}},)

q_model = quantization.fit(
    "/path/to/llama2_70b_chat/decoder_model.onnx", # FP32 model path
    config,
    calib_dataloader=dataloader)
q_model.save("/path/to/Llama-2-70b-chat-hf-onnx-int4/decoder_model.onnx") # INT4 model path

Evaluation

Operator Statistics

Below shows the operator statistics in the INT4 ONNX model:

Op Type Total INT4 weight FP32
MatMul 641 561 80

Evaluation of perplexity

Evaluate the model with evaluation API of Intel® Extension for Transformers on lambada_openai task.

from intel_extension_for_transformers.evaluation.lm_eval import evaluate

model_path = "/path/to/Llama-2-70b-chat-hf-onnx-int4"
tokenizer = "Intel/Llama-2-70b-chat-hf-onnx-int4"
batch_size = 64
tasks=["lambada_openai"]

results = evaluate(
    model="hf-causal",
    model_args="pretrained=" + model_path + ",tokenizer="+ tokenizer,
    batch_size=batch_size,
    tasks=tasks,
    model_format="onnx"
)
Model Model Size (GB) lambada_openai acc lambada_openai ppl
FP32 257 0.7543 2.6181
INT4 43 0.7510 2.6561
Downloads last month
0
Hosted inference API
This model can be loaded on the Inference API on-demand.

Datasets used to train Intel/Llama-2-70b-chat-hf-onnx-int4