Data Serialization, Deserialization, and Migration using HAppS-Data

Data Migration in HAppS

HAppS applications, like any application with persistent data storage, are faced with the issue of migrating existing data when the format of the persistent data is changed. This tutorial will explore the binary serialization and migration facilities provided by HAppS-Data. If you think I am doing it all wrong, please let me know. Writing this tutorial is the extent of my experience using the HApps-Data migration facilities.

Requirements

This tutorial only uses the HAppS-Data (and dependencies) portion of HAppS. It has been tested with HAppS-Data 0.9.3. The first three lines of the module look like this:

> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-} > module Main where > import HAppS.Data

Serialization

The most obvious way to serialize data in Haskell is to use the familiar Read and Show classes. Simply use show to turn a value into a String, and read to turn a String back into a value. This method has three serious flaws however:

The law read . show == id does not hold for all Show/Read instances.
The serialized representation is rather verbose
No migration path when types change, leaving your old data inaccessible

HAppS-Data provides a Serialize class which addresses these three issues. From an end user point of view the Serialize functionality provides three items of interest:

The Serialize class
the serialize and deserialize functions
The deriveSerialize function

#ifdef HsColour > class (Typeable a, Version a) => Serialize a where > ... > > serialize :: Serialize a => a -> Lazy.ByteString > deserialize :: Serialize a => Lazy.ByteString -> (a, Lazy.ByteString) > > deriveSerialize :: Language.Haskell.TH.Syntax.Name > -> Language.Haskell.TH.Syntax.Q [Language.Haskell.TH.Syntax.Dec] #endif

The Version superclass is used during data migration. The serialize and deserialize functions are the counterparts to show and read. deriveSerialize is a Template Haskell function which provides functionality similar to deriving (Read, Show).

The `Version` class

The Version class is very straight-forward. It consists of a single function which returns the Mode (aka, the version) of a datatype.

#ifdef HsColour > > class Version a where > mode :: Mode a > mode = Versioned 0 Nothing > > data Mode a = Primitive -- ^ Data layout won't change. Used for types like Int and Char. > | Versioned (VersionId a) (Maybe (Previous a)) > > newtype VersionId a = VersionId {unVersion :: Int} deriving (Num,Read,Show,Eq) #endif

There are two categories of datatypes:

primitives whose layout will never change, and, hence, will never need to be migrated
everything else

The Versioned constructor takes two arguments. The first argument is a version number which you increment when you make an change to the data-type. The second argument is an indicator of the previous version of the data-type. The exact details are covered in the next section.

Putting it all together

Let's say we have the following types:

> > $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default] > [d| > data Foo > = Bar String > | Baz Beep > > data Beep > = Beep > |])

The deriveAll template haskell function is similar to the normal haskell deriving clause, except it also has the ability to derive Default instances. Additionally, it always derives Typeable and Data instances even though they are not explicitly listed.

To make the types serializeable we first need to create Version instances.

> instance Version Beep where > mode = Versioned 0 Nothing > > instance Version Foo where > mode = Versioned 0 Nothing

We want to indicate that Beep and Foo are non-primative types, so we use the Versioned constructor. Next we specify a version number for the type. It could be anything, but 0 is the most sensible choice. Since there are now previous versions of these types we mark the previous type as Nothing.

For all non-primitive types the initial version of Versioned 0 Nothing is sensible. So the Version class provides it as a default value for mode:

#ifdef HsColour > class Version a where > mode :: Mode a > mode = Versioned 0 Nothing #endif

Hence, we could shorten our Version instances from above to:

#ifdef HsColour > instance Version Beep > instance Version Foo #endif

Next we derive Serialize instances for our types:

> $(deriveSerialize ''Beep) > $(deriveSerialize ''Foo)

Now we can use serialize to serialize values. Let's look at the output of serialize Beep

*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>

We see that Beep serializes to 9 bytes. The first 8 bytes represent the VersionId. VersionId is basically an Int, and the serialization code always treats Ints as a 64-bit values to avoid cross-platform issues. The final byte indicates which constructor of Beep was used. In this case the zeroth constructor was used.

At first it may seem like we don't have enough information here to deserialize the data, after all there are no type names, constructors, etc. But deserializing these bytes is no different than doing read "1" :: Int. Because we know the type of the value we want to be reading at compile time, we do not need to record that information in the stored data. We just do:

*Main> deserialize (serialize Beep) :: (Beep,ByteString)
(Beep,Empty)
*Main>

As a side note, Strings are serialized to a very compact representation. In fact, they are stored as compactly as ByteStrings because they are first converted to a ByteString.

*Main> Data.ByteString.Lazy.unpack $ serialize "hello"
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main> Data.ByteString.Lazy.unpack $ serialize (Data.ByteString.Lazy.Char8.pack "hello")
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main>

The first 8 bytes are the length of the String, and the remaining bytes are the utf-8 encoded characters of the String.

So, if you application is best served by using Strings instead of ByteStrings, you do not have to take an extra steps to ensure that the serialized data is compactly represented.

Simple Migration

Let's say we want to add another constructor to the Beep type. As a first pass, we will actually create a whole new type named Beep', which is similar to the old type, but has an additional constructor BeepBeep.

> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default] > [d| > data Beep' = BeepBeep' | Beep' > |]) > > $(deriveSerialize ''Beep')

Because we are extending a previous type, our Version instance will look a bit different:

> instance Version Beep' where > mode = extension 1 (Proxy :: Proxy Beep)

This indicates that we are extending the old type Beep. The new version number must be higher than the old version, but does not have to be strictly sequential.

Because we specified that this type is a newer version of an older type, we also need to tell HAppS how to migrate the old data to the new type. To do this, we simply create an instance of the Migrate class.

#ifdef HsColour > class Migrate a b where > migrate :: a -> b #endif

The Migrate class is quite simple, it contains a single function, migrate which migrates something of type a to type b. In our current example, all we need is:

> instance Migrate Beep Beep' where > migrate Beep = Beep'

We can demonstrate migration by serializing a value of type Beep and deserializing it as type Beep'. The migration happens automatically in the deserialize function.

*Main> fst $ deserialize (serialize Beep) :: Beep'
Beep'
*Main>

When deserialize tries to deserialize the data produced by serialize Beep, it will first check the version number. When it sees that the version number in the stored data is lower than the version number of the current type it will instead try to decode it as the type you specified as the previous version. If the version associated with the previous type is still higher than the value in the serialized data, the migration code will recurse until it finds a matching version number. Once it finds a matching version number, it will call the corresponding deserialization "instance" to decode the old data. Then as the recursion unwinds, it will apply the migrate function to migrate the data to newer and newer formats until it reaches the newest format.

Managing History

A big issue in the above example is that when we added the new constructor we also changed the name of the type and its existing constructors. That is not very convenient in a real application where you have a multitude of references to the old names.

Fortunately, we do not have to change the name of the type to add a new constructor. As we saw in the beginning, the name of the type and the names of the constructors are not actually stored in the serialized data. So, instead we can change the name of the old type from Beep to OldBeep and update its constructor as well.

> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default] > [d| > data OldBeep = OldBeep > |]) > > $(deriveSerialize ''OldBeep) > instance Version OldBeep

Because OldBeep and Beep have the same shape, they will serialize to the same byte sequence:

*Main> Data.ByteString.Lazy.unpack $ serialize OldBeep
[0,0,0,0,0,0,0,0,0]
*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>

that means we can serialize an OldBeep value and then deserialize it as a Beep value, like this:

*Main> fst $ deserialize (serialize OldBeep) :: Beep
Beep
*Main>

Note that this is not the same as migration. Here we are just exploiting the fact that because the type name and constructor names are not encoded in the serialized data we can change those names and still be able to deserialize the data.

Full Migration Example #1

Here is the full example which shows:

Beep renamed to OldBeep
the new Beep with the extra constructor
the migration code from OldBeep to Beep

#ifdef HsColour > $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default] > [d| > data OldBeep > = OldBeep > |]) > > instance Version OldBeep > $(deriveSerialize ''OldBeep) > > > $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default] > [d| > data OldBeep > = OldBeep > data Beep = BeepBeep | Beep > |]) > > instance Version Beep where > mode = extension 1 (Proxy :: Proxy OldBeep) > > $(deriveSerialize ''Beep) #endif

Using separate files to manage type history

Keeping all the revisions of your type in one file, and changing the name of the type and its constructors every revision is tedious and hard to manage. Instead, we can use a system where we rename the files that contain our types. To start, we will put the types we want to serialize in a separate file (or files), such as Types.lhs.

#ifdef HsColour #include "step1/Types.lhs" #endif

Now let's say we want to add a constructor Bop to the type Foo. First we rename Types.lhs to Types_000.lhs and change the module name to reflect the changed file name:

#ifdef HsColour #include "step2/Types_000.lhs" #endif

Next we create a new Types.lhs:

#ifdef HsColour #include "step2/Types.lhs" #endif

Serializing Datatypes from 3rd Party Libraries

It is also possible to serialize datatypes from 3rd party libraries, provided those types have Data and Typeable instances. There is a caveat with this however. If the third party library changes the type, then you will not be able to read your data. This is not a fatal flaw however. You can simply copy the old type definition into a local module, and then migrate the old data to the new format.

Suggested Policy

Put the types you will serialize in one or more files which only contain types
Deploy your web 2.718 killer app
Before you do any more development, copy the current type files to sequential versions and create new current type files which re-export all the types. You can skip this step if the current type file only contains re-exports. i.e., if no type changes were made to that type file during the previous iteration.
Make changes for current development cycle, and then go to step 2.