API Example: Working with Large Result Dataset

Question

How do I pull a large data set from the API?

Answer

In order to pull a large dataset from our API, you will need to refine your queries so that each returns fewer than 100,000 total records. To ensure stability for all customers, we limit the total results of any search to 100,000 records. You can retrieve these results in pages of up to 1,000 records per request.

If you need to retrieve more than 100,000 records overall, you must break your dataset into multiple smaller queries (for example, by filtering on date ranges or other criteria).

Recommended steps:

Define your query filters for the entire dataset you want to retrieve.
Check your result size to determine if the query produces a result near or above the 100,000-record limit.
If necessary, break your filters into smaller chunks. For example, if you are pulling data for an entire year, chunk your searches into quarters or months.
For each chunked query, page through the result set using pages of up to 1,000 records per request.

Example

You want transactions filtered by

company tags in: private_lender
transaction date: 1/1/2024-12/31/2024

This search result in a response data set larger than 100k records.

To work around this, we should pull one month at a time.

import requests
import math

# API configuration
api_key = "YOUR_API_KEY"
api_url = "https://webapp.forecasa.com/api/v1/transactions"

# Headers with x-api-key for authentication
headers = {
    "x-api-key": api_key,
    "Accept": "application/json"
}

# Query parameters (maximum result size is 100k so chunk by transaction date)
params = {
    'page': 1,                      # Start with first page
    'page_size': 1000,              # Number of results per page (max 1000)
    'q[s]': "transaction_date desc",           # Sort by transaction date (desc)
    'q[company_tags_in][]': "private_lender",  # Company Stage
    'q[transaction_date_gteq]': '01/01/2024',  # Start date 
    'q[transaction_date_lteq]': '01/31/2024'   # End date
    
}

# Step 1: Make initial request to get first page and total count
response = requests.get(api_url, params=params, headers=headers)
data = response.json()

# Extract data from first page
transactions = data['transactions']
total_count = data['total_count']
max_result_window = data['max_result_window']

# if max_result_window >= total_count, refine your filters to be more restrictive

# Step 2: Calculate total number of pages
total_pages = math.ceil(total_count / params['page_size'])

# Step 3: Process all remaining pages
for page in range(2, total_pages + 1):
    # Update page parameter
    params['page'] = page
    
    # Request the next page
    response = requests.get(api_url, params=params, headers=headers)
    page_data = response.json()
    
    # Add transactions from this page to our collection
    transactions.extend(page_data['transactions'])

# Now 'transactions' contains all items from all pages
print(f"Retrieved {len(transactions)} transactions")