Data Synchronisation

Norman Graves
November 14, 2013 6 min read
Charity Website Design

It is not uncommon for a website to require data to be drawn from external data sources, these could be CRM systems, ERP systems or legacy custom databases. Frequently these systems will reside in a different environment to the CMS itself, necessitating communication over some sort of data connection. This presents a risk to the operation of the site because it opens up the possibility that the link might go down or even that the external system might be unavailable. Equally it is not uncommon for information entered into a web page by a user to be written to an external system such as a CRM or ERP system.

The challenge is to design a robust mechanism for transferring data between the CMS system and these external systems which can deal with system degradation and at the same time present an acceptable user experience.

Almost all CMS systems feature some sort of structured data construct. Such structured items are essence an XML document. A typical structured content item might contain plain text elements, rich text elements, enumerated types, etc. The structured content item presents an ideal structure in which to place data which is synchronised from an external data source since it can be designed in such a way as to match the structure of the external data record. In this way the CMS has access to a local copy of the data, so even if the external system is unavailable the CMS can present the latest data to the user. 

Web services provide the ideal mechanism to transfer the data. In a sense the web service acts as a transport layer. Web services offer security and reliability for handling this low level data transfer. The question is how to use them in a manner which is efficient and fault tolerant.

The simplest mechanism would be to read the data from the external system “on demand”. In other words to read the data as it is required by the current web page. There are a number of problems with this approach: there may be delays in obtaining the data since it has to be read from the remote data source, the transfer time must be added to the page load time, and if the link is down the page will not have any data to display.

We could modify the algorithm slightly to overcome the latter of these two problems and still read the data on demand, but save it to a structured data item; then if the link is down and the data cannot be read directly from the external system, the system can use the latest version taken from the structured data item. However this is a compromise and has a number of shortcomings itself. In the worst case the last read might have happened several hours or even days ago, and the data might have changed since then and such a scheme would add delay to the page load time, since now it is necessary to wait until the link has timed out before referring to the locally saved copy. 

We can improve on the on demand mechanism by setting up a background process within the CMS which continuously reads the data from the external system and saves it to structured data items in the CMS. In this way the CMS does not need to do anything special, it simply presents the data that it sees in its structured data items as if it were conventional CMS data. The background process always ensures that these structured data items contain the latest version of the data.

To be efficient, the background process should only transfer data which has changed since it was last read. There is little point in simply reading all the data over and over again, transferring the same set of values over and over again. Such a brute force approach is inefficient in its use of bandwidth and could add considerable latency to the frequency at which data items are updated. A much more efficient approach, indeed the optimum approach is to detect which data items have changed and only send these changed values across the link.

The system then works as follows:

The background process running in the CMS environment sends a web service request complete with a time/date stamp. The external system then identifies all of the records that have changed since the time/date stamp and marshals them into a web service response. The background process receives this list of items and updates the structured data items accordingly. 

Once the transfer is complete the system repeats the request. The time/date stamp derives from the last successful data transfer, so this each request in effect asks for a list of items that have changed since the last request.

Where the two systems are operating in different time zones or if there is no easy way to guarantee that the clocks on the two systems remain in sync, then the time/date stamp can be replaced by a more abstract sequence number. The background process sends out the request with a sequence number, the external system responds to that request quoting the sequence number. The sequence number is then incremented and the process repeated. Each system associates a sequence number with the time and date from its own internal clock and uses this to detect changes. The external system sends over a list of items that have changed since the last set of data was sent. 

The likelihood of any data item changing since the last request is very small, even with a busy system such changes are relativity infrequent, most requests therefore will contain a response that indicates that there are no changes to report. Every now and then there will be a change and this will be reported. In exceptional circumstances there may be more than one change. Since the background process runs in a tight loop, every change will be reported almost instantaneously. The latency will therefore be extremely small, perhaps a few milliseconds. Indeed it is not possible to conceive of a scheme which has less latency since sending the changes in data represents the minimum amount of data that it is necessary to send.

There are a couple of exceptions that we need to handle. What happens if a data item is deleted in the external system? In this case the response packet needs to flag the fact that the item has changed, but to indicate that the nature of the change is that it has been deleted. Generally the Ektron CMS does not delete items, it merely marks them as being deleted and leaves them in the database (this is to allow for data recovery and rollback). So this is what would happen here.

There is of course the opposite – where a new item has been created in the external system. In this case the external system will send over the change, but the item does not exist in the CMS. When processing the response packet of information, the CMS checks whether the item being sent already exists, if it does then the item is modified, if not then a new item is created. 

So far we have dealt with the case where we wish to read data from an external system in order to present the results on a web page. By introducing a background process and making it responsible for the data synchronisation the CMS can operate by simply treating the data as being there and up to date in its own data repository. 

The background task takes care of data creation, data deletion and ensures that the data in the CMS is up to date. It also takes care of other errors such as loss of the data link or a failure in the external system. If the link is down or the external system is not available, the request for new changes goes unanswered. After some reasonable time out period the background process retries using the same sequence number. It goes on doing so until the external system responds at which time the data transfer resumes and the systems go back into sync with one another.

In some cases it is necessary to deal with bi directional data transfer. This might be the case for example where the CMS user could create an instance of a structured data item in the CMS which needs to be stored in the external system.

In this case we can set up a more or less identical system which works in the opposite direction. Since we already have a background process running in the CMS environment, it makes sense to use this to control events. We know which records in the CMS which have changed since this can only happen as a result of an On Publish Event or as a result of a change which was initiated by an external system. 

Things can be simplified if we extend the cross referencing system so that instead of just having a reference to the external id in the structured data item, we store the structured data item id in the external system. 

In this case what happens is as follows:

The background process builds a list of changed structured data items based on the On Publish Event. It also adds to this list any items that were created (new items) as a result of requests from the external system. It sends over this list with a sequence number. The external system processes these changes and sends back a response indicating that it has completed the task. The CMS then sends over the next set of changes with a new sequence number. In the event that the external system is not available or the link is down, the system times out and resends the same request after a short interval.

The reason for sending back items which have been created in the CMS as a result of a change sent from the external system is to ensure that we write back the structured data item id to the external system. The external system sent over a new item that had an external id, but no structured data item Id at that time. The CMS adds the item to its database, which causes it to be allocated a structured data item id. This item is then flagged as having changed (since it has now acquired a CMS id) and so is re-synched back to the external system.

Written by Norman Graves