Are Hadoop and Spark Silver Bullets for Financial Market Data ETL?

Vasyl Soloshchuk

capital markets data

Robo-advisors’ activity strongly depends on financial market data, which they need to process and analyze in order to suggest the best portfolio management approach for their clients. This data includes end-of-day prices, bid and ask quotes, split ratios, account data, etc. The efficiency of a particular robo-advisor depends on the number of securities covered (scope) and the historical depth (number of years) of data processed. The volume of data that needs to be stored thereby grows significantly.

Financial market data may be received from multiple sources, such as data aggregators and vendors. A number of companies provide market data and account data aggregation, and each of them support particular data formats. The following table contains a list of nine popular financial data providers, and information about the data types and formats that they provide.

Data Provider Data Formats Supported Data Types Provided
Bloomberg

https://www.bloomberg.com

API Real-time, End-of-day, Historical data
Commodity Systems Inc.

http://www.csidata.com/

ASCII, CSI, CSV, CSIM,  MetaStock, Excel

API

End-of-day, Historical data
CQG Inc.

http://www.cqg.com

ASCII, Excel Real-time, Historical data
Marketsource Online

http://www.msodata.com/

ASCII End-of-day, Historical data
Quovo

https://www.quovo.com

JSON; SFTP batch files

RESTful API; webhooks for real-time visibility into end user activity

Real-time, Historical data
Xignite

http://www.xignite.com

XML, JSON, CSV

REST, SOAP API

Real-time, Historical data
Finicity

https://www.finicity.com

XML, JSON

API

Real-time, Historical data
Money.net

https://money.net

Excel

API

End-of-day, Real-time data
EODData

http://eoddata.com

ASCII, EzyCharts, MetaStock, Quicken, CSV End-of-day, Historical data

The most common formats are ASCII, CSV, MetaStock, and Excel. However, data may be delivered in other formats as well. Most data vendors provide APIs that give access to the market data. SOAP and REST protocols are generally used.

Thus, robo-advisors receive vast amounts of data of various formats. All this data needs to be collected, stored, transformed, governed, integrated, analyzed, and visualized.

To deal with it all, the Extract, Transform, Load (ETL) process is used.

What is ETL and why does a WealthTech project need it?

ETL is the process of loading data from the source systems to the data warehouse. The steps are as follows:

  • Extract
    The data is extracted from the data vendors and made accessible for processing in the current wealth-management project (e.g., a robo-advisor). An incremental extract (only modified and new records are extracted) or full extract may be performed. Because the volumes of wealth-management data can be immense, the incremental extract is prefered. The full extract may affect performance, response time, and other parameters of both the data vendor and the robo-advisor.
  • Transform
    To ensure the quality of the data in the data warehouse, the extracted data requires cleansing according to the following unification rules:

    1. Make identifiers unique;
    2. Convert data to a standardized format;
    3. Validate fields against each other.

    Data transformation also includes joining and deduplicating data from several market data providers; sorting; deriving new calculated values; etc.

  • Load
    The load step ensures that the data is loaded into the database, and that this process is performed correctly. In wealth-management projects, new data does not generally overwrite existing data; instead, new data is added in a historical form at regular intervals—on an hourly or daily basis.

Given the various formats of financial market data, the ETL process enables the transformation and merging of data received from different sources, and the easier integration with requirements of algorithms and logics of a robo-advisor platform.

To use as few resources as possible, the ETL process ensures that there is no data redundancy, and significant focus within the process is on data quality and data monitoring.

How can we provide ETL for WealthTech?

Due to the high volumes of data involved, the ETL process may take certain amount of time and cause delays before the data can be leveraged by a WealthTech platform. Thus, the data value is limited, as it is not usable for some time. However, Big Data solutions, such as Hadoop and Spark, enable us to change this situation in terms of aggregating, transforming, sorting, and analyzing data inside the frameworks.

Let’s take a look at how this works:

  • Real-time financial data is aggregated by multiple data vendors in different formats.
  • This data is extracted in real time by a robo-advisor.
  • The data is aggregated and transformed in real time inside a Big Data repository (e.g., Hadoop or Spark).
  • The financial market transformations are processed by a robo-advisor platform to perform efficient customer-portfolio management—that is, to make decisions about asset allocation, diversification, or rebalancing.
  • As the robo-advisor platform performs certain actions, such as selling or purchasing assets, the outcomes of these actions, among others, are captured by data aggregators.

ETL in a WealthTech platfrom

Because of the enormous volumes of financial market data and the need to manage data movement, transformation, and analysis, Hadoop and Spark bring a tremendous advantage for wealth-management platforms.

However, both these data frameworks have advantages and disadvantages.

Apache Hadoop

Pros:

  • Hadoop can run on commodity hardware or in the cloud.
  • Hadoop has extensive functionality, including deriving data from any format and converting it into a single format; data warehousing; fraud detection; etc.
  • Hadoop has means to store large volumes of raw data without increasing expenses. Thus, if a robo-advisor changes its logics in the future, it can refer to the raw data.
  • Hadoop provides data processing and uses tools located on same servers as the data. Thus, processing operations are carried out faster.
  • By duplicating data and creating multiple copies, Hadoop protects against data failure and loss.
  • Clustering allows for fault tolerance.
  • Hadoop is scalable and gives the platform flexibility as the data lake grows.

Cons:

  • In Hadoop, the security measures are disabled by default. To secure sensitive data, specific measures should be taken.
  • Hadoop runs more slowly that Spark because Spark works in-memory.
  • Hadoop requires highly skilled specialists to operate it.

Apache Spark

Pros:

  • Spark is an in-memory platform, so it executes data processing more than 10 times faster than Hadoop does.
  • Fault tolerance is achieved using data replication to several nodes.
  • Spark supports many languages, such as Java, Scala, Python, etc.
  • Spark is highly configurable—it can thus be configured to work in the best way for each particular use case.
  • Spark supports graph execution, machine learning, and streaming functionalities.

Cons:

  • Spark takes up significant system resources.
  • Spark has no file management system, so it needs to rely on other platforms.
  • If not configured correctly, Spark may have performance issues in a production environment.

Conclusion

Robo-advisors are platforms with rapidly growing data pools. To gather financial market data from multiple sources, store the data in reliable and secure way, and enable further processing for the platform’s purposes, Big Data frameworks appear to be the best solution. Hadoop and Spark are extremely popular; however, in order to make the best of them, you need skilled IT developers with experience in working with particular frameworks. While Hadoop is used for its ability to store increasingly large amounts of data, Spark is also becoming popular due to its compatibility with machine learning.

As open-source frameworks, both Hadoop and Spark have been widely implemented. Nevertheless, today there are several alternatives, and you may find a framework that suits your platform better than those mentioned here.