Pickle Scanning
Pickle is a widely used serialization format in ML. Most notably, it is the default format for PyTorch model weights.
There are dangerous arbitrary code execution attacks that can be perpetrated when you load a pickle file. We suggest loading models from users and organizations you trust, relying on signed commits, and/or loading models from TF or Jax formats with the from_tf=True
auto-conversion mechanism. We also alleviate this issue by displaying/βvettingβ the list of imports in any pickled file, directly on the Hub. Finally, we are experimenting with a new, simple serialization format for weights called safetensors
.
What is a pickle?
From the official docs :
The
pickle
module implements binary protocols for serializing and de-serializing a Python object structure.
What this means is that pickle is a serializing protocol, something you use to efficiently share data amongst parties.
We call a pickle the binary file that was generated while pickling.
At its core, the pickle is basically a stack of instructions or opcodes. As you probably have guessed, itβs not human readable. The opcodes are generated when pickling and read sequentially at unpickling. Based on the opcode, a given action is executed.
Hereβs a small example:
import pickle
import pickletools
var = "data I want to share with a friend"
# store the pickle data in a file named 'payload.pkl'
with open('payload.pkl', 'wb') as f:
pickle.dump(var, f)
# disassemble the pickle
# and print the instructions to the command line
with open('payload.pkl', 'rb') as f:
pickletools.dis(f)
When you run this, it will create a pickle file and print the following instructions in your terminal:
0: \x80 PROTO 4
2: \x95 FRAME 48
11: \x8c SHORT_BINUNICODE 'data I want to share with a friend'
57: \x94 MEMOIZE (as 0)
58: . STOP
highest protocol among opcodes = 4
Donβt worry too much about the instructions for now, just know that the pickletools module is very useful for analyzing pickles. It allows you to read the instructions in the file without executing any code.
Pickle is not simply a serialization protocol, it allows more flexibility by giving the ability to users to run python code at de-serialization time. Doesnβt sound good, does it?
Why is it dangerous?
As weβve stated above, de-serializing pickle means that code can be executed. But this comes with certain limitations: you can only reference functions and classes from the top level module; you cannot embed them in the pickle file itself.
Back to the drawing board:
import pickle
import pickletools
class Data:
def __init__(self, important_stuff: str):
self.important_stuff = important_stuff
d = Data("42")
with open('payload.pkl', 'wb') as f:
pickle.dump(d, f)
When we run this script we get the payload.pkl
again. When we check the fileβs contents:
# cat payload.pkl
__main__Data)}important_stuff42sb.%
# hexyl payload.pkl
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 80 04 95 33 00 00 00 00 β 00 00 00 8c 08 5f 5f 6d βΓβ’Γ30000β000Γβ’__mβ
β00000010β 61 69 6e 5f 5f 94 8c 04 β 44 61 74 61 94 93 94 29 βain__ΓΓβ’βDataΓΓΓ)β
β00000020β 81 94 7d 94 8c 0f 69 6d β 70 6f 72 74 61 6e 74 5f βΓΓ}ΓΓβ’imβportant_β
β00000030β 73 74 75 66 66 94 8c 02 β 34 32 94 73 62 2e βstuffΓΓβ’β42Γsb. β
ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
We can see that there isnβt much in there, a few opcodes and the associated data. You might be thinking, so whatβs the problem with pickle?
Letβs try something else:
from fickling.pickle import Pickled
import pickle
# Create a malicious pickle
data = "my friend needs to know this"
pickle_bin = pickle.dumps(data)
p = Pickled.load(pickle_bin)
p.insert_python_exec('print("you\'ve been pwned !")')
with open('payload.pkl', 'wb') as f:
p.dump(f)
# innocently unpickle and get your friend's data
with open('payload.pkl', 'rb') as f:
data = pickle.load(f)
print(data)
Here weβre using the fickling library for simplicity. It allows us to add pickle instructions to execute code contained in a string via the exec
function. This is how you circumvent the fact that you cannot define functions or classes in your pickles: you run exec on python code saved as a string.
When you run this, it creates a payload.pkl
and prints the following:
you've been pwned !
my friend needs to know this
If we check the contents of the pickle file, we get:
# cat payload.pkl
c__builtin__
exec
(Vprint("you've been pwned !")
tR my friend needs to know this.%
# hexyl payload.pkl
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 63 5f 5f 62 75 69 6c 74 β 69 6e 5f 5f 0a 65 78 65 βc__builtβin___exeβ
β00000010β 63 0a 28 56 70 72 69 6e β 74 28 22 79 6f 75 27 76 βc_(Vprinβt("you'vβ
β00000020β 65 20 62 65 65 6e 20 70 β 77 6e 65 64 20 21 22 29 βe been pβwned !")β
β00000030β 0a 74 52 80 04 95 20 00 β 00 00 00 00 00 00 8c 1c β_tRΓβ’Γ 0β000000Γβ’β
β00000040β 6d 79 20 66 72 69 65 6e β 64 20 6e 65 65 64 73 20 βmy frienβd needs β
β00000050β 74 6f 20 6b 6e 6f 77 20 β 74 68 69 73 94 2e βto know βthisΓ. β
ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
Basically, this is whatβs happening when you unpickle:
# ...
opcodes_stack = [exec_func, "malicious argument", "REDUCE"]
opcode = stack.pop()
if opcode == "REDUCE":
arg = opcodes_stack.pop()
callable = opcodes_stack.pop()
opcodes_stack.append(callable(arg))
# ...
The instructions that pose a threat are STACK_GLOBAL
, GLOBAL
and REDUCE
.
REDUCE
is what tells the unpickler to execute the function with the provided arguments and *GLOBAL
instructions are telling the unpickler to import
stuff.
To sum up, pickle is dangerous because:
- when importing a python module, arbitrary code can be executed
- you can import builtin functions like
eval
orexec
, which can be used to execute arbitrary code - when instantiating an object, the constructor may be called
This is why it is stated in most docs using pickle, do not unpickle data from untrusted sources.
Mitigation Strategies
Donβt use pickle
Sound advice Luc, but pickle is used profusely and isnβt going anywhere soon: finding a new format everyone is happy with and initiating the change will take some time.
So what can we do for now?
Load files from users and organizations you trust
On the Hub, you have the ability to sign your commits with a GPG key. This does not guarantee that your file is safe, but it does guarantee the origin of the file.
If you know and trust user A and the commit that includes the file on the Hub is signed by user Aβs GPG key, itβs pretty safe to assume that you can trust the file.
Load model weights from TF or Flax
TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTorch architectures using the from_tf
and from_flax
kwargs for the from_pretrained
method to circumvent this issue.
E.g.:
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-cased", from_flax=True)
Use your own serialization format
This last format, safetensors
, is a simple serialization format that we are working on and experimenting with currently! Please help or contribute if you can π₯.
Improve torch.load/save
Thereβs an open discussion in progress at PyTorch on having a Safe way of loading only weights from *.pt file by default β please chime in there!
Hubβs Security Scanner
What we have now
We have created a security scanner that scans every file pushed to the Hub and runs security checks. At the time of writing, it runs two types of scans:
- ClamAV scans
- Pickle Import scans
For ClamAV scans, files are run through the open-source antivirus ClamAV. While this covers a good amount of dangerous files, it doesnβt cover pickle exploits.
We have implemented a Pickle Import scan, which extracts the list of imports referenced in a pickle file. Every time you upload a pytorch_model.bin
or any other pickled file, this scan is run.
On the hub the list of imports will be displayed next to each file containing imports. If any import looks suspicious, it will be highlighted.
We get this data thanks to pickletools.genops
which allows us to read the file without executing potentially dangerous code.
Note that this is what allows to know if, when unpickling a file, it will REDUCE
on a potentially dangerous function that was imported by *GLOBAL
.
Disclaimer: this is not 100% foolproof. It is your responsibility as a user to check if something is safe or not. We are not actively auditing python packages for safety, the safe/unsafe imports lists we have are maintained in a best-effort manner. Please contact us if you think something is not safe, and we flag it as such, by sending us an email to website at huggingface.co
Potential solutions
One could think of creating a custom Unpickler in the likes of this one. But as we can see in this sophisticated exploit, this wonβt work.
Thankfully, there is always a trace of the eval
import, so reading the opcodes directly should allow to catch malicious usage.
The current solution I propose is creating a file resembling a .gitignore
but for imports.
This file would be a whitelist of imports that would make a pytorch_model.bin
file flagged as dangerous if there are imports not included in the whitelist.
One could imagine having a regex-ish format where you could allow all numpy submodules for instance via a simple line like: numpy.*
.
Further Reading
pickle - Python object serialization - Python 3.10.6 documentation
Dangerous Pickles - Malicious Python Serialization
GitHub - trailofbits/fickling: A Python pickling decompiler and static analyzer
cpython/pickletools.py at 3.10 Β· python/cpython
cpython/pickle.py at 3.10 Β· python/cpython
CrypTen/serial.py at main Β· facebookresearch/CrypTen