⚠️ This page is development and staging only. To progress, update your page to publishThisPage(Astro, "prod")

Guide to evaluating market data APIs

July 04, 2024

Whether you're new to using market data APIs or on the search for a new data provider, it can be challenging to determine what a quality API looks like. In our new series, we'll share our perspective on important factors to consider when evaluating a market data API.

In this guide, we'll dive into some key questions to ask and common pitfalls to avoid when evaluating a market data provider. Read the entire guide, or jump to a relevant section via the table of contents below. We hope this helps you find the right market data provider for your needs.

  • Why referring to every venue as an exchange is a red flag
  • Point-in-time definitions and why they're important
  • When technical indicators are a red flag—and when they're necessary

In this section, we cover what we like to call the exchange smell test: when every venue is called an "exchange."

We previously had a customer point out an API he used that had everything either prefixed with "Exg" or suffixed with "Exch."

This name is inaccurate for several reasons:

  1. Millions of active fixed-income instruments, such as corporate bonds, municipal bonds, and cash treasuries, are not traded on exchanges.
  2. A majority of FX trading takes place between banks, dealer markets, and ECNs like EBS, FXall, and Currenex.
  3. There are whole asset classes like forwards that, by definition, don't trade on a centralized exchange.
  4. The notional volume of derivatives trading dwarfs that of equities. A significant amount of derivatives trading is off-exchange, e.g., swaps on SEFs.
  5. ATSes and off-exchange trading make-up almost half of US equities ADV.
  6. Non-exchange venues have important microstructural value for price discovery, e.g., some of these allow you to identify your counterparties. Through this, you can glean the percentage of contras that execute against a particular market-maker, how sharp a particular market-maker's order flow is, follow a more sophisticated market-maker on the rollover, or know if the same counterparty has been picking you off repeatedly.

When an API and its documentation fail the exchange smell test, it's a hint that the API designers have limited or no institutional trading experience and aren't prepared to extend the API beyond stock exchanges or a handful of retail trading venues. You'll most likely spend much of your time as your API provider's bug hunter, with support engineers who aren't knowledgeable of any market microstructure.

If you're writing a financial API, whether it's for internal or customer use, it's better to use the term "venue" unless you're certain you'll only ever deal with exchanges. This naming convention may seem pedantic, but it'll creep fast into downstream code, pair programming sessions, research meetings, and trading desk conversations. Before you know it, you'll start saying things like "FINRA TRF exchange" or "UBS exchange."

Check out our article on venues and publishers for more information on how we use these terms at Databento. You can also check out our docs to learn more about our market data APIs.

Many APIs provide instrument definitions as either a static endpoint where you can get the data for a specific day or as a flat file. On the other hand, point-in-time definitions give the user a stream of messages as they get published by the venue.

1. Venues use point-in-time definitions While flat files were at some point the standard way in which venues published their instrument data, most venues today have switched to point-in-time definitions, with the flat file as a fallback.

Futures venues such as CME and Eurex use point-in-time definitions to create new complex instruments in the middle of the trading day, which don't get reflected on the flat file. CME specifically discourages users from using the flat file since it doesn't account for intraday updates to instruments.

2. Without them, you risk missing out on an IPO Providers that don't offer point-in-time definitions usually populate their endpoint by polling the venues' flat files a few times per day to load new instruments.

The issue with this is when you load the venue's flat file for the day, the new instruments haven't been made available yet, so you won't have the data for the newcomers and aren't able to trade them. This is a recurring issue that's often seen in the industry: traders want to get in on an IPO at the open, but their system doesn't recognize the new instrument yet.

Point-in-time instrument definitions solve this problem by ensuring that as soon as the venue publishes data for the instrument, you'll receive it, and your system will be able to trade it.

3. Point-in-time instrument definitions behave the same for historical and real-time data When you're backtesting a model against an API using a reference data flat file, it's much harder to account for intraday updates since you usually don't know when the file was updated on the source and when your system pulled the update.

This can lead to look-ahead bias because your model only has access to data that was published later in the day. Point-in-time instrument definitions, on the other hand, behave in the same way for historical and real-time data.

At Databento, much of our team comes from a trading background, so we've seen all of these problems and more in our previous experiences. We've made sure that Databento natively supports point-in-time definitions for both our historical and real-time APIs, to address these issues commonly seen in other API providers.

To learn more about our point-in-time definitions and view examples of how they're used, you can check out our docs.

In this article, we dig into when technical indicators are a red flag, and when they're necessary. To identify the red flags, we use what we call the technical indicator smell test: when your market data API has extensive support for technical indicators and derived features.

Most technical indicators have an insignificant correlation with future returns, which makes them impractical for developing profitable trading strategies.

It's usually better to compute returns on the client side. This may be controversial, but we're quite opinionated about it at Databento. Even some popular values like Greeks and VWAPs are often better computed on the client side; Greek calculation is subjective to model and inputs, and top trading firms rarely use third-party Greeks.

Good APIs usually have a small API surface with few orthogonal API methods. While technical indicators may serve a specific audience, their purpose is so far downstream that it's bloated for most users who don't need them. Here are a few examples:

  • Poor model fit: Most technical indicators have functional forms that result in poor model fit, e.g., truncation, censorship, non-linearity, discretization, and dependence on subsampled time-space.
  • Expensive to compute online: Many technical indicators have functional forms that are expensive to compute online, e.g., require a sliding window of memory or cost linear time in the lookback period to compute.
  • Many venues have derived data policies: Sometimes, these indicators are published without compliance with the venue's data policies.

Although we've pointed out many reasons not to support technical indicators, there are limited situations in which we do recommend them:

  • I/O is always a limiting factor when providing a data-intensive API, so it's often better to push computation closer to the source, up the I/O or memory hierarchy, from client to server, etc. The same principle underpins Apache Spark and edge computing. If the API is used by a web application that serves many users, it could be cheaper to compute these on the server side.
  • It's easy to communicate as a starter example in your documentation; many of these technical indicators are widely known and have simple, functional forms.

You can view our docs page for examples of calculating common technical indicators on Databento, like VWAP and RSI or TICK and TRIN.

One of the most essential features of a market data API for a systematic trading platform is using the same code for backtesting and live trading. Market data vendors often advertise claims like "data quality," "rows of data," "reliable," and "fast" but don't address matters that improve actual trading workflows.

Many vendors are missing market replay functionality, have different formats for historical flat files and real-time data, and have differences between their historical and real-time interfaces.

Integrating their APIs propagates the problem to your platform and makes it hard to design effective interfaces to harmonize backtesting vs. production.

Using the same data format for historical and real-time data allows your strategy to be implemented only once. It doesn't have to port over from Python/MATLAB/R to C++. The strategy calls the same interfaces, and the same configs/meta-configs are reused for prod deployment. These features speed up your production cycle.

Additionally, it reduces second-guessing if OOS performance is worse due to model overfit, covariate shift, software bug, or implementation error in your simulator or ported strategy.

It also saves valuable time for quantitative researchers, developers, and engineers, which is more expensive than your infrastructure. If your backtesting engine is slow, you can always throw more IO or compute at the problem, but scaling human resources is much more challenging.

At Databento, we make it easy to integrate and design an effective interface by having a zero-copy binary format that's identical for both historical and real-time data, exposing a market replay method in our clients that emulates real-time, and supporting intraday replay in our real-time API so you can burn in your signals directly without stitching historical data to real-time. We also deliberately designed our historical and live APIs to look identical.

Learn more about our historical and live market data APIs on our docs page.