<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
          "http://www.w3.org/TR/html4/strict.dtd">
<html>
 <head>
  <title>Data Serialization, Deserialization, and Migration using HAppS-Data</title>
  <link type='text/css' rel='stylesheet' href='hscolour.css'>
  <link type='text/css' rel='stylesheet' href='blog.css'>
 </head>
 <body>

<!--

> {-# OPTIONS_GHC -cpp #-}

-->

  <h2>Data Migration in HAppS</h2>
  <p>HAppS applications, like any application with persistent data storage, are faced with the issue of migrating existing data when the format of the persistent data is changed. This tutorial will explore the binary serialization and migration facilities provided by HAppS-Data. If you think I am doing it all wrong, please let me know. Writing this tutorial is the extent of my experience using the <kbd>HApps-Data</kbd> migration facilities.</p>
<!--
  <ul>
   <li>deriveSerialize
   <li>encode/decode
   <li>Version and Migrate
  </ul>


String vs ByteString
Record
Nested types
-->
<h3>Requirements</h3>

<p>This tutorial only uses the <kbd>HAppS-Data</kbd> (and dependencies) portion of <kbd>HAppS</kbd>. It has been tested with <kbd>HAppS-Data 0.9.3</kbd>. The first three lines of the module look like this:</p>

> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Main where
> import HAppS.Data

<h3>Serialization</h3>

<p>The most obvious way to serialize data in Haskell is to use the familiar <code>Read</code> and <code>Show</code> classes. Simply use <code>show</code> to turn a value into a <code>String</code>, and <code>read</code> to turn a <code>String</code> back into a value. This method has three serious flaws however:</p>

<ol>
 <li>The law <code>read . show == id</code> does not hold for all Show/Read instances. 
 <li>The serialized representation is rather verbose
 <li>No migration path when types change, leaving your old data inaccessible
</ol>

<p><kbd>HAppS-Data</kbd> provides a <code>Serialize</code> class which addresses these three issues. From an end user point of view the <code>Serialize</code> functionality provides three items of interest:</p>
 <ol>
  <li>The <code>Serialize</code> class
  <li>the <code>serialize</code> and <code>deserialize</code> functions
  <li>The <code>deriveSerialize</code> function
 </ol>

#ifdef HsColour
> class (Typeable a, Version a) => Serialize a where 
>   ...
>
> serialize :: Serialize a => a -> Lazy.ByteString
> deserialize :: Serialize a => Lazy.ByteString -> (a, Lazy.ByteString)
>
> deriveSerialize :: Language.Haskell.TH.Syntax.Name
>                  -> Language.Haskell.TH.Syntax.Q [Language.Haskell.TH.Syntax.Dec]
#endif

<p>The <code>Version</code> superclass is used during data migration. The <code>serialize</code> and <code>deserialize</code> functions are the counterparts to <code>show</code> and <code>read</code>. <code>deriveSerialize</code> is a Template Haskell function which provides functionality similar to <code>deriving (Read, Show)</code>.</p>

<h3>The <code>Version</code> class</h3>

<p>The <code>Version</code> class is very straight-forward. It consists of a single function which returns the <code>Mode</code> (aka, the version) of a datatype.</p>

#ifdef HsColour
>
> class Version a where
>     mode :: Mode a
>     mode = Versioned 0 Nothing
>
> data Mode a = Primitive -- ^ Data layout won't change. Used for types like Int and Char.
>             | Versioned (VersionId a) (Maybe (Previous a))
>
> newtype VersionId a = VersionId {unVersion :: Int} deriving (Num,Read,Show,Eq)
#endif

<p>There are two categories of datatypes:</p>
 <ul>
  <li>primitives whose layout will never change, and, hence, will never need to be migrated
  <li>everything else
 </ul>

<p>The <code>Versioned</code> constructor takes two arguments. The first argument is a version number which you increment when you make an change to the data-type. The second argument is an indicator of the previous version of the data-type. The exact details are covered in the next section.

<h3>Putting it all together</h3>

<p>Let's say we have the following types:</p>

>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
>  [d|
>      data Foo 
>          = Bar String
>          | Baz Beep
>            
>      data Beep 
>          = Beep
>    |])

<p>The <code>deriveAll</code> template haskell function is similar to the normal haskell deriving clause, except it also has the ability to derive <code>Default</code> instances. Additionally, it always derives <code>Typeable</code> and <code>Data</code> instances even though they are not explicitly listed.</p>

<p>To make the types serializeable we first need to create <code>Version</code> instances.</p>

> instance Version Beep where
>     mode = Versioned 0 Nothing
>
> instance Version Foo where
>     mode = Versioned 0 Nothing

<p>We want to indicate that <code>Beep</code> and <code>Foo</code> are non-primative types, so we use the <code>Versioned</code> constructor. Next we specify a version number for the type. It could be anything, but <code>0</code> is the most sensible choice. Since there are now previous versions of these types we mark the previous type as <code>Nothing</code>.

<p>For all non-primitive types the initial version of <code>Versioned 0 Nothing</code> is sensible. So the <code>Version</code> class provides it as a default value for <code>mode</code>:</p>

#ifdef HsColour
> class Version a where
>    mode :: Mode a
>    mode = Versioned 0 Nothing
#endif

<p>Hence, we could shorten our <code>Version</code> instances from above to:</p>

#ifdef HsColour
> instance Version Beep
> instance Version Foo
#endif

<p>Next we derive <code>Serialize</code> instances for our types:</p>

> $(deriveSerialize ''Beep)
> $(deriveSerialize ''Foo)

<p>Now we can use <code>serialize</code> to serialize values. Let's look at the output of <code>serialize Beep</code></p>

<pre>
*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>
</pre>

<p>We see that <code>Beep</code> serializes to 9 bytes. The first 8 bytes represent the <code>VersionId</code>. <code>VersionId</code> is basically an <code>Int</code>, and the serialization code always treats <code>Int</code>s as a 64-bit values to avoid cross-platform issues. The final byte indicates which constructor of <code>Beep</code> was used. In this case the zeroth constructor was used.</p>

<p>At first it may seem like we don't have enough information here to deserialize the data, after all there are no type names, constructors, etc. But deserializing these bytes is no different than doing <code>read "1" :: Int</code>. Because we know the type of the value we want to be reading at compile time, we do not need to record that information in the stored data. We just do:</p>

<pre>
*Main> deserialize (serialize Beep) :: (Beep,ByteString)
(Beep,Empty)
*Main> 
</pre>

<p>As a side note, <code>String</code>s are serialized to a very compact representation. In fact, they are stored as compactly as <code>ByteString</code>s because they are first converted to a <code>ByteString</code>. </p>

<pre>
*Main> Data.ByteString.Lazy.unpack $ serialize "hello"
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main> Data.ByteString.Lazy.unpack $ serialize (Data.ByteString.Lazy.Char8.pack "hello")
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main> 
</pre>

<p>The first 8 bytes are the length of the <code>String</code>, and the remaining bytes are the utf-8 encoded characters of the <code>String</code>.</p>

<p>So, if you application is best served by using <code>String</code>s instead of <code>ByteString</code>s, you do not have to take an extra steps to ensure that the serialized data is compactly represented.</p>

<h3>Simple Migration</h3>

<p>Let's say we want to add another constructor to the <code>Beep</code> type. As a first pass, we will actually create a whole new type named <code>Beep'</code>, which is similar to the old type, but has an additional constructor <code>BeepBeep</code>.</p>

> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
>  [d|
>      data Beep' = BeepBeep' | Beep'
>    |])
>
> $(deriveSerialize ''Beep')

<p>Because we are extending a previous type, our <code>Version</code> instance will look a bit different:</p>

> instance Version Beep' where
>     mode = extension 1 (Proxy :: Proxy Beep)

<p>This indicates that we are extending the old type <code>Beep</code>. The new version number must be higher than the old version, but does not have to be strictly sequential.</p>

<p>Because we specified that this type is a newer version of an older type, we also need to tell HAppS how to migrate the old data to the new type. To do this, we simply create an instance of the <code>Migrate</code> class.</p>

#ifdef HsColour
> class Migrate a b where
>    migrate :: a -> b
#endif

<p>The <code>Migrate</code> class is quite simple, it contains a single function, <code>migrate</code> which migrates something of type <code>a</code> to type <code>b</code>. In our current example, all we need is:</p>

> instance Migrate Beep Beep' where
>     migrate Beep = Beep'

<p>We can demonstrate migration by serializing a value of type <code>Beep</code> and deserializing it as type <code>Beep'</code>. The migration happens automatically in the <code>deserialize</code> function.</p>

<pre>
*Main> fst $ deserialize (serialize Beep) :: Beep'
Beep'
*Main>
</pre>

<p>When <code>deserialize</code> tries to deserialize the data produced by <code>serialize Beep</code>, it will first check the version number. When it sees that the version number in the stored data is lower than the version number of the current type it will instead try to decode it as the type you specified as the previous version. If the version associated with the previous type is still higher than the value in the serialized data, the migration code will recurse until it finds a matching version number. Once it finds a matching version number, it will call the corresponding deserialization "instance" to decode the old data. Then as the recursion unwinds, it will apply the <code>migrate</code> function to migrate the data to newer and newer formats until it reaches the newest format.</p>

<h3>Managing History</h3>

<p>A big issue in the above example is that when we added the new constructor we also changed the name of the type and its existing constructors. That is not very convenient in a real application where you have a multitude of references to the old names.</p>

<p>Fortunately, we do not have to change the name of the type to add a new constructor. As we saw in the beginning, the name of the type and the names of the constructors are not actually stored in the serialized data. So, instead we can change the name of the old type from <code>Beep</code> to <code>OldBeep</code> and update its constructor as well.</p>

> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
>  [d|
>      data OldBeep = OldBeep
>   |])
>
> $(deriveSerialize ''OldBeep)
> instance Version OldBeep

<p>Because <code>OldBeep</code> and <code>Beep</code> have the same shape, they will serialize to the same byte sequence:

<pre>
*Main> Data.ByteString.Lazy.unpack $ serialize OldBeep
[0,0,0,0,0,0,0,0,0]
*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main> 
</pre>

<p>that means we can serialize an <code>OldBeep</code> value and then deserialize it as a <code>Beep</code> value, like this:</p>

<pre>
*Main> fst $ deserialize (serialize OldBeep) :: Beep
Beep
*Main> 
</pre>

<p>Note that this is not the same as migration. Here we are just exploiting the fact that because the type name and constructor names are not encoded in the serialized data we can change those names and still be able to deserialize the data.</p>

<h4>Full Migration Example #1</h4>

<p>Here is the full example which shows:</p>
 <ol>
  <li><code>Beep</code> renamed to <code>OldBeep</code>
  <li>the new <code>Beep</code> with the extra constructor
  <li>the migration code from <code>OldBeep</code> to <code>Beep</code>
 </ol>


#ifdef HsColour
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
>  [d|
>      data OldBeep 
>          = OldBeep
>   |])
>
> instance Version OldBeep
> $(deriveSerialize ''OldBeep)
>
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
>  [d|
>      data OldBeep 
>          = OldBeep
>      data Beep = BeepBeep | Beep
>    |])
>
> instance Version Beep where
>     mode = extension 1 (Proxy :: Proxy OldBeep)
>
> $(deriveSerialize ''Beep)
#endif

<h4>Using separate files to manage type history</h4>

<p>Keeping all the revisions of your type in one file, and changing the name of the type and its constructors every revision is tedious and hard to manage. Instead, we can use a system where we rename the files that contain our types. To start, we will put the types we want to serialize in a separate file (or files), such as <kbd>Types.lhs</kbd>.</p>

#ifdef HsColour
#include "step1/Types.lhs"
#endif

<p>Now let's say we want to add a constructor <code>Bop</code> to the type <code>Foo</code>. First we rename <kbd>Types.lhs</kbd> to <kbd>Types_000.lhs</kbd> and change the module name to reflect the changed file name:</p>

#ifdef HsColour
#include "step2/Types_000.lhs"
#endif

<p>Next we create a new <kbd>Types.lhs</kbd>:</p>

#ifdef HsColour
#include "step2/Types.lhs"
#endif


<h3>Serializing Datatypes from 3rd Party Libraries</h3>

<p>It is also possible to serialize datatypes from 3rd party libraries, provided those types have <code>Data</code> and <code>Typeable</code> instances. There is a caveat with this however. If the third party library changes the type, then you will not be able to read your data. This is not a fatal flaw however. You can simply copy the old type definition into a local module, and then migrate the old data to the new format.</p>

<h3>Suggested Policy</h3>

<ol>
 <li>Put the types you will serialize in one or more files which only contain types
 <li>Deploy your web 2.718 killer app
 <li>Before you do any more development, copy the current type files to sequential versions and create new current type files which re-export all the types. You can skip this step if the current type file only contains re-exports. i.e., if no type changes were made to that type file during the previous iteration.
 <li>Make changes for current development cycle, and then go to step 2.
</ol>

 </body>
</html>