Difficult Data Made Easy with Gravicus Data Lines

The Gravicus Osprey Platform has been in development for several years now. It is a cutting edge solution designed from the ground up specifically for Information Governance and driven by some of the foremost experts in the field.

In a series of blog posts I hope to provide an introduction to our company and technology whilst sharing some insights into the journey that brought us here.

Let’s start at the beginning. Our products are all about “Information Governance”, which covers a huge array of fields; from Data Migrations, through Records Management to Legal, Privacy, and much more. The common element is helping organisations handle Information. But before we can sort out Information, we need to start with raw Data. Enterprise Data comes in many shapes, sizes, formats, volumes and sources. The vast majority of it is unstructured (by which I mean human-written). That doesn’t make it any easier for computers to extract useful Information. But if it was easy, it wouldn’t be worth doing!

In this world of Unstructured Data Analytics, a lot of fanfare goes to visualisation and dashboards; they look great and everyone likes to show off beautiful graphs (we certainly do). It’s easy to assume that the data formatting behind those dashboards is all handled through various standards and imports. Data Formats can be extremely varied. Even down to the terminology used.

However, in practice data comes in many, many formats and the bulk of the work (and cost) on data-driven projects tends to go into moving, cleaning, indexing, sorting and filtering raw data. I’ve seen it takes years. In fact I’ve seen projects that never end because the data modelling can’t keep up with the company’s workforce.

I once went to a conference on Knowledge Management where an IT Director at an AMLaw100 Firm proudly told a room full of people about her successful migration of 50,000 KM Documents from one version of SharePoint to another. It had taken her team “only” three years. The reason it took so long? Together with the migration they decided on a metadata mapping project, to clean up and enrich while migrating. Their process consisted of tweaking a script, running a full migration, looking at the result output, and starting again.

This is a pretty common scenario. It makes perfect sense to use a Data Migration project as an Information Cleanup opportunity. Nobody wants to move junk across to a brand new environment, they want to start from a nice clean slate.

It’s when you start looking at Information, rather than the Data that it’s in, that things get tricky. Moulding data until you get useful Information out of it can be very difficult, time consuming and expensive. In fact often it can feel more like bashing than moulding.

Very early on we needed to come up with an effective way to get lots of data into our Osprey Platform. At the time it was tempting to create a regular Search Crawler; sometimes called spiders, these just read information from documents and push what they find into a Search Index.

On the right is the not-so-detailed technical spec we originally worked to. I dug this up from an early slide deck. Sometimes you just have to start somewhere!

However, we rapidly realised we also had to address the need for other Information Governance use cases and revised that; not all data would be going to a Gravicus Index. Sometimes it might go to Alfresco, and sometimes it won’t move at all but just get encrypted. All depends what’s needed at the time.

It just goes to show even the most basic early plans can turn out to be incorrect. We chucked out the first attempt and started again.

We needed a system which could act on the information in the data, in addition to just shifting data.

So we took a step back and -using our domain expertise- designed something from scratch. We were working to create a Next Generation Information Governance system. That means we needed a mechanism which could generically handle a lot of data use cases without compromising on usability, flexibility, scalability or defensibility.

We came up with Data Lines. They represent a fresh new look at migrating, securing, archiving, crawling or collecting data from and to any source in an organisation.

Data Lines are like train lines. Content rapidly moves along them and at the stops it can be enriched, organised, or filtered on the way to its destination. Like dropping by a museum on the way home to pick up some new information. Now imagine you could decide for yourself which stops you want the line to go past on any given day. That’s the power of Data Lines.

We came up with the lego-brick approach. Well, I guess Lego came up with that, but we decided to adopt it. As with many of the most powerful solutions, by chaining together blocks of relatively simple logic we can achieve tremendous results.

A sample sequence of processors in a DataLine which migrates data from SharePoint to Alfresco, Filters out unwanted types, and Enriches metadata with Gravicus’ Artificial Intelligence

In the example above, we’re chaining together a SharePoint connector, a basic metadata filter (to remove junk documents), an AI annotator (to enrich metadata) and and finally an Alfresco Sender. This is a relatively simple, but very powerful example of how a DataLine can be set up. Without the need for any coding, this will migrate content, clean it and enrich it. Not bad going.

As I mentioned above; data-driven projects can be real time-sinks. That’s where attention to detail really comes in. That’s why Data Lines are designed from the ground up to minimise effort on all levels.

For example; Data Lines can inherit configuration – so if you have lots of them (you might have hundreds) they can all have a common parent. That way if, for instance, your security credentials change, you only have to make that change once.

These kind of details help prevent human error and help keep projects on track.

We have lots of “Processor” types, Connectors, Senders, Annotators, and so on to put in the chain. Together they cover many use cases. And we keep adding to the collection continually, based on client requests.

However, we can’t cover everything, we know that. So if you have some weird, old, bespoke, entirely-unknown-to-anyone-but-Gandalf-from-IT  datasource (there’s always one…), the API is completely open; we invite clients and third parties to create their own connectors, enrichers, filters, senders, and anything else they want to add there.

No coding needed, but certainly permitted.

It’s really easy to write, plug in and run new processors. Let the framework handle the hard parts!

Sometimes in the real world it’s important to look carefully at the things that aren’t flashy, but can make a difference in a practical sense. That’s what we’ve done with Data Lines.

They’re only one part of the Osprey Platform (we’ll cover the other parts in future posts), but clients have really taken to working with Data Lines. Like all the best things in life, it’s the simplicity and flexibility that really makes them so appealing. We work hard to keep the complexity “under the hood” and that seems to be well appreciated.

Using Data Lines makes it easy to perform migrations, index information, archive unstructured data, enrich content or secure information on the spot. And if you combine them with other parts of the Osprey Platform it gets even better!

If you’d like to learn more about our Data Lines, do get in touch. We’re always happy to share.