Learn Python Series (#28) - Using Pickle and Shelve

in #utopian-io6 years ago (edited)

Learn Python Series (#28) - Using Pickle and Shelve

python_logo.png

Repository

https://github.com/python/cpython

What will I learn?

  • In this episode of the Learn Python Series you will learn about two additional ways to serialize and de-serialize Python objects for persistent storage: pickle and shelve,
  • you will learn when to (not) use pickle over JSON, and when to (not) use shelve over a "real" database environment,
  • also we'll discuss some dangers to be aware of when using pickle

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution;
  • The ambition to learn Python programming.

Difficulty

  • Beginner

Curriculum (of the Learn Python Series):

Additional sample code files

The full - and working! - iPython tutorial sample code file is included for you to download and run for yourself right here:
https://github.com/realScipio/learn-python-series/blob/master/pickle-tut01.ipynb

GitHub Account

https://github.com/realScipio

Learn Python Series (#28) - Using Pickle and Shelve

Welcome to episode #28 of the Learn Python Series! In episode #15 we focused our attention on the JSON file format and reading from and writing to .json files. JSON is language- and platform-independent and human-readable as well. It can be used to serialize / deserialize JSON data to and from Python objects, and because .json files can be stored on disk they can also be shared among processes and computer systems. However, great as JSON might be, for serializing / de-serializing Python objects, it does have its limitations, for not all Python object formats can be "JSON-i-fied": JSON for example doesn't properly differentiate between lists and tuples, object keys are required to be strings, and datetime objects could be customized to work with JSON but not "out-of-the-box" (requires custom (de)serialization). Also, there are situations where "human-readable" could be considered a security risk, ergo not in every situation using JSON is preferred.

Pickle vs JSON

Pickle, which as a module is part of most Python distributions (including Anaconda), can also be used for serializing and de-serializing Python objects. "Pickling" out-of-the-box converts almost any Python object (apart from a few edge-case scenarios, like generators and lambda functions, which we haven't discussed yet in earlier episodes) into a character stream that can be saved to disk, where the character stream contains all the information that's needed to rebuild the object by the same or another Python program.

You could pickle the following object types:

  • normal and unicode strings
  • integers, floats, complex numbers
  • lists, dictionaries, tuples, sets
  • None, True and False
  • (built-in) functions and classes defined at a module's top level

Nota bene:
As opposed to JSON however, Pickle is not platform independent (it can even vary per Python version), it's rather slow, uses a binary format (ergo not human-readable), and could be a security risk for executing arbitrary code, contained in the pickle, while de-serializating. So while this last sentence might not sound like a great sales-pitch to make a case for using Pickle, if you don't have language interoperability requirements for exchanging serialized objects, if you don't have to deal with untrusted data sources and if a binary format is OK or even preferred, then Pickle works great!

Let's see how Pickle works!

Working with pickle

In order to work with pickle, first import it:

import pickle

Serializing (pickling, dumping)

Now, say, we want to pickle a list of our favorite cryptos, like these:

fav_cryptos = [
    "Steem",
    "Steem Backed Dollars",
    "Bitcoin",
    "IoTeX",
    "Litecoin",
    "Stellar",
    "Byteball",
    "Tether"
]

Like json, pickle also has two main methods:

  • dump, to serialize and "dump" a Python object to file, and
  • load, to de-serialize a pickled file object.

For writing the fav_cryptos list to file via pickle, we need to first specify the filename:

filename = "fav_cryptos.p"

Next we define the file object, we open the file for writing via the open() function, to which we pass in two arguments: the filename, and wb for writing in binary mode.

fileobject = open(filename, 'wb')

Now that the file is opened for writing to, use pickle.dump() and pass in the object you want to pickle (in this case our fav_cryptos list) as its first argument, and the fileobject as its second argument:

pickle.dump(fav_cryptos, fileobject)

Then close the file object.

fileobject.close()

At this point, the object fav_cryptos is saved on disk as the pickled file fav_cryptos.p!

You could also use the following shorthand notation via with, that will automatically close the file for you (for this example I'll use another list and pickle dumpfile):

import pickle
colors = [
    'Green', 
    'Yellow', 
    'Orange', 
    'Red',
    'Blue',
    'Brown',
    'White',
    'Black'
]
with open('colors.p', 'wb') as f:
    pickle.dump(colors, f)

De-serializing (unpickling, loading)

Unpickling a pickled file is quite similar: open() the file again but now use the rb flag (for reading in binary mode), and use pickle.load() to assign it to a new variable:

import pickle
fileobject = open('fav_cryptos.p', 'rb')
unpickled_cryptos = pickle.load(fileobject)
fileobject.close()

print(type(unpickled_cryptos), unpickled_cryptos)
<class 'list'> ['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether']

Or, again use the shorthand notation using with:

import pickle
with open('colors.p', 'rb') as f:
    unpickled_colors = pickle.load(f)
    
print(type(unpickled_colors), unpickled_colors)
<class 'list'> ['Green', 'Yellow', 'Orange', 'Red', 'Blue', 'Brown', 'White', 'Black']

As you can see, while printing the unpickled object, I also checked to see the unpickled file types, and they're both correctly typed as a list. You could go a step further and compare the original to the unpickled objects to see if they are the same:

print(fav_cryptos == unpickled_cryptos)
print(colors == unpickled_colors)
True
True

Again, a word of caution using pickle with untrusted data sources

As explained, because functions and classes could also be pickled and executed while unpickling, as a rule of thumb, simply never use pickle with unknown systems. Yet if you must for some reason, make sure to use an encrypted network connection, and/or cryptographically sign and verify the pickle, and/or restrict file system permissions.

Working with shelve

shelve is built on top of pickle, and it acts somewhat like a database. In fact, you can use shelve as a persistent Python object store when you don't want to or can't use a "real" database. Shelved objects are also pickled, but via shelve the objects are associated with a string key. This means you can access your pickled objects via their key, just like you would with a Python dictionary! shelve is pretty convenient when serializing many objects.

In order to work with shelve first import it:

import shelve

Serializing (shelving, dumping)

The shelve syntax is pretty similar to pickle. Let's shelve the objects that we unpickled before!

with shelve.open('test_shelf') as shelf:
    shelf['cryptos'] = unpickled_cryptos
    shelf['colors'] = unpickled_colors

At this point, a shelved database file is stored (test_shelf.db on macOS, but on other systems, depending on the specific DBM implementation that is used, you might get output files with no extension, or with the extensions .bak, .dat, .dir, or .pag.)

De-serializing (ununshelving, loading)

In order to access the shelved data, just open the shelf via shelve.open() and use it like you would with a "normal" Python dictionary:

with shelve.open('test_shelf') as shelf:
    shelved_colors = shelf['colors']

print(shelved_colors)
['Green', 'Yellow', 'Orange', 'Red', 'Blue', 'Brown', 'White', 'Black']

If you don't know the keys that exist within the shelf, you can of course list them, like so:

with shelve.open('test_shelf') as shelf:
    print(list(shelf.keys()))
['cryptos', 'colors']

Listing the key values (although quite slow), can be used via the values() method:

with shelve.open('test_shelf') as shelf:
    print(list(shelf.values()))
[['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether'], ['Green', 'Yellow', 'Orange', 'Red', 'Blue', 'Brown', 'White', 'Black']]

Since shelves behave like dictionaries, if you want to iterate over all shelved items, you can:

with shelve.open('test_shelf') as shelf:
    for key in shelf:
        print(key, shelf[key])
cryptos ['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether']
colors ['Green', 'Yellow', 'Orange', 'Red', 'Blue', 'Brown', 'White', 'Black']

Updating / modifying shelves

By default, a shelf doesn't track any updates / modifications on a de-serialized object.
So if you would try to do the following (to add Monero to the shelved list of cryptos), the shelf itself isn't persisently updated:

with shelve.open('test_shelf') as shelf:
    shelf['cryptos'].append('Monero')

with shelve.open('test_shelf') as shelf:
    shelved_cryptos = shelf['cryptos']

print(shelved_cryptos)
['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether']

As you can see, after re-loading the shelf Monero is not contained in the shelved cryptos list.

You can of course do so, by two ways:

  1. de-serialize the shelf, create a copy, append the item (Monero) to the copy, and then store the entire copied item back to the shelf using its key:
with shelve.open('test_shelf') as shelf:
    cryptos = shelf['cryptos']
    cryptos.append('Monero')
    shelf['cryptos'] = cryptos

with shelve.open('test_shelf') as shelf:
    shelved_cryptos = shelf['cryptos']

print(shelved_cryptos)
['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether', 'Monero']

And now Monero is added to the persistent shelf.

  1. The second way, which is less verbose but is slower and demands more RAM usage, is by opening the shelve including the flag writeback=True, and directly appending the new item to the shelf (let's now add Dash since Monero is already added):
with shelve.open('test_shelf', writeback=True) as shelf:
    shelf['cryptos'].append('Dash')

with shelve.open('test_shelf') as shelf:
    shelved_cryptos = shelf['cryptos']

print(shelved_cryptos)
['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether', 'Monero', 'Dash']

Removing Dash is of course as simple as:

with shelve.open('test_shelf', writeback=True) as shelf:
    shelf['cryptos'].remove('Dash')

with shelve.open('test_shelf') as shelf:
    shelved_cryptos = shelf['cryptos']

print(shelved_cryptos)
['Steem', 'Steem Backed Dollars', 'Bitcoin', 'IoTeX', 'Litecoin', 'Stellar', 'Byteball', 'Tether', 'Monero']

Deleting elements from a shelf

If you want to complete remove a shelf element, for example all cryptos stored in shelf['cryptos'], then use the del keyword:

with shelve.open('test_shelf') as shelf:
    del shelf['cryptos']

with shelve.open('test_shelf') as shelf:
    print(list(shelf.keys()))
['colors']

Shelf concurrency and Read-Only using flag=r

Please note that the underlying DBM module powering shelf databases, doesn't support concurrent writing, for example when multiple applications try to write to a shalve database at the same time / when opened.

DBM does however support concurrent reads, so a smart thing to do if you want to use concurrency, is to let the client that only wants to read from the shelve do so in read-only mode by passing flag=r while opening the shelf.

For demonstration purposes, I'll now explicitly import the dbm module as well, so that it can print an error message when the read-only shelve does try to write:

import dbm

with shelve.open('test_shelf', flag='r') as shelf:
    try:
        colors = shelf['colors']
        colors.append('Pink')
        shelf['colors'] = colors
        print(shelf['colors'])
    except dbm.error as err:
        print(f"Woops! There's an error: {err}")
Woops! There's an error: cannot add item to database

What did we learn, hopefully?

Using pickle and shelve, although they should be treated with care (i.e. security issues when executing pickled code), is both pretty powerful and convenient, to me at least. If you would scroll back to my earlier #14 episode, on developing a mini-Steem crawler for account discovery, just imagine how much easier it would have been to simply apply a shelf and update the todo, done and all files!

I hope you enjoyed this tutorial as much as I have writing it!

Thank you for your time!

Sort:  

Woah, I loved how expository your code was. I am also a python programmer but am not too familiar with this module. Thanks for the tutorial. A question though. Why is the dump function there when a suitable alternative would be to commit the dictionary to a file simply using such code for example

       x = open(filename, 'w+')
        x. write(fav_cryptos)
        x. close 

Wouldn't this append the file to the document or is it for a more peculiar purpose?

Hi, and thx! ;-)

I've discussed 2 modules, pickle and shelve (they're two different modules, where shelve is built on top of pickle). The key take-away of this episode is to explain that dumping via pickling is a way to "serialize" a Python object (for example a dictionary or a list), which is kept in RAM and only works "inside the program), and is then "pickled" to save in a file in a binary mode, where the pickle contains the full instructions to un-pickle, ergo, to read it back, by another or the same program at another time, after the program was closed, and to put it back in RAM.

Simple use case? When you're playing a game and you're saving your progress on level 14 before you to to sleep. Saving the "game state" == pickling ;-)

Ooh wow
That's elucidiated things
I fully get the application now
Its like saving a data state and all the changes with a single module. That's awesome
Thanks

Hey @yalzeee
Here's a tip for your valuable feedback! @Utopian-io loves and incentivises informative comments.

Contributing on Utopian
Learn how to contribute on our website.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Thank you for your contribution.

  • Good tutorial, thanks for your work!

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thx! It was pretty fun to write as well, especially this episode!

Thank you, scipio. Upvoted and resteemed!

@ArtTurtle is an upvote bot run by @Artopium dedicated to upvoting your art, music, fashion, video and books. Find out how you can get an upvote for every creative post you make by visitng @ArtTurtle and reading the latest report.

Hey @scipio
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!