Value creation across data pipelines

Understanding how data can be processed and monetised through each phase of the data pipeline

Andrea Armanni
Ocean Protocol

--

Data scientists are always on the lookout for new and innovative ways to use data. Data is used to understand the world and make predictions about the future and, when applied to AI, data becomes the heart of training ML algorithms. These aim to learn an accurate model of the world in particular scenarios and applications such as robotics, medical diagnostics, finance and other applications that have the potential to bring about positive change.

To be accurate and provide compelling business and operational outcomes, ML models require not only a significant amount of data, but also high-quality data, provided through highly efficient and scalable pipelines. Just like a healthy heart needs oxygen and reliable blood flow, AI/ML engines need a steady stream of cleansed, accurate, and trusted data.

Currently, data is a big barrier to build AI systems. High quality, curated data is rarely available . This has led to data being siloed and value extraction being concentrated in the hands of a few dominant players, hindering innovation at scale.

The need of the hour is to promote data sovereignty and steer the data economy from its previous shadowed and opaque state into one that promotes decentralized data exchange and transparency.

Understanding how data can be processed is the first step to achieve this.

In this article, we’ll go through the different stages of data preparation and look at how much value there is in each step of the data pipeline.

A data pipeline refers to those data processing activities that perform advanced sourcing, processing, transformation, and loading of data.

Phase 1: Data Extraction

Data extraction is an essential step in data processing pipelines. It involves scraping data from raw data sources and preparing it for further analysis. Sources of data may include content from databases, files, web pages, or blockchains. Making this available allows us to kick-start the data journey towards machine learning.

Hint: for extracting on-chain data visit Messari Subgraphs

Phase 2: Data Processing

This phase involves data cleaning and formatting for further analysis or use. In the case of data applied to trading, a data scientist may take candle data and derive the median value from it. They may do some sort of cleaning normalization, and transform the raw data (Open, High, Low, Close, Volume) into processed data (median value). There are infinite ways to data processing, each one with its own use case and value.

Phase 3: Data Transformation

In order to get the most out of the data, it is essential to properly transform it. Once the data has been processed, data scientists can extract feature vectors. For instance, it may be of interest to determine if the past values of a stock are informative of its future value. For this, basic transformations such as the Moving Average (MA) are used extensively. Many other types of transformations are used by data scientists in order to extract valuable information from data, including non-linear transformations and high-dimensional expansions among others. ,.

Given the wide variety of transformations available to extract features from data, it is a common practice to generate as many features as possible and then rank them according to their predicting power. When enough data is available the feature extraction/selection can be directly included in the ML model during training.

Phase 4: Model Training & Testing

This is when the model starts to take shape. Features extracted during the Data transformation phase are preselected based on a performance metric that indicates how well each feature helps to make correct predictions. During model training, model parameters are learned such that the prediction error is minimized. The quality of the resulting model depends on the model selected but also on the quality and the quantity of the data/features provided. This is analogous to a child learning to speak, the more they listen to other adults speaking correctly the more their brain language regions are tuned to recreate speech accurately.

Phase 5: Model Deployment

Finally, when the time comes and the model is trained, we use this model to make predictions on new data where the target variable is unknown. More data goes through this pipeline, gets ingested and is used to return a more accurate prediction.

As explained above, each dataset requires work before it can go through the next phase of the data pipeline. Because time and knowledge is invested to make such transformations, value can be extracted from each single dataset.

Moreover, if datasets stay fresh, hence are constantly cleaned, improved and extended, then multiple insights and trading strategies can be developed over time which will make those datasets extremely valuable.

In an open-source environment, people can publish any data assets independently of the level of transformation. For example, one can make available just raw data for users to consume. Alternatively, to save time and skip the initial phases of the data cleaning pipeline, users might decide to buy someone’s future vector data and pay a premium for it.

Bottomline: through the data pipeline there are plenty of different points in which data computation can happen and in each phase there is going to be value created for publishing that data asset, whether that’s raw data or a highly accurate predictive model.

With data challenges being the first internal user of Ocean.py, we aim to expand the features available along with the community and explore different ways to extract value from all steps of the data pipeline.

The ETH prediction data challenge series present a great opportunity to leverage Ocean tech and the resources that the community has created around it.

The Ocean Market provides a wide variety of datasets and the opportunity to monetize data and predictions algorithms while maintaining sovereignty over them in a fully decentralized environment.

About Ocean Protocol

Ocean Protocol is a decentralized data exchange platform spearheading the movement to democratise AI, break down data silos, and open access to quality data. Ocean’s intuitive marketplace technology allows data to be published, discovered, and consumed in a secure, privacy-preserving manner by giving power back to data owners. Ocean resolves the tradeoff between using private data and the risks of exposing it.

Follow Ocean Protocol on Twitter, Telegram, or GitHub. And chat directly with the Ocean community on Discord.

--

--

Web 3 advocate. Passionate about decentralized AI and DeFi with strong competencies in Token Design, Financial Markets, Social Economics and Product Management