Good practices, pains and other considerations from a client's perspective when building third party integrations

Here is a brain dump of things that I consider when I have to build a new integration with some third party API provider.

Notes are quite sketchy. If something needs of a longer explanation let me know!

Truths

They are messy.
Third parties will become unavailable, both in small doses (few requests “randomly” failing) and in big ones (few minutes or hours of downtime).
SFTP is a perfectly acceptable way of integrating.

Practices

Initial implementation

Read the docs.
- Do not just fiddle with the endpoints but read the actual docs.
Play with the API to understand behaviours. What if:
- the url is wrong? Does it return a 404?
- the data queried does not exist? Empty result or 404?
- we miss some query/body parameters?
- we concurrently update the same piece of information?
- we exceed the rate limit?
Estimate call rate and data volumes:
- Is batching available?
- Is pagination available?
- Get a sense of the performance:
  - It might affect the implementation and business flow.
Document findings
- Include support contact information and expectations.
Subscribe the provider’s status page.
Find out what is their change management process:
- Subscribe to whatever you need to subscribe to find out about changes.
- Is it a newly built third party?
  - Try to get direct access to their technical team.
  - Expect loads of backwards incompatible changes.
  - VS mature products, higher chance that the bug is on their side.
Consider a dark launch:
- Put the integration in production, so it is exercised but do not use it in a client facing functionality.
- Minimum monitoring (error and performance) required for this to be useful.
- Useful also to collect real example responses to use for additional testing.

On the way to production

What is the acceptable business process when the integration fails?
- Integration will fail.
- Avoid “at all costs” making an integration mandatory for a key user flow.
- What is the fallback mechanism? Default answer?
- If there is a reasonable business flow, consider recording which user flows where affected and proper actions retried/amended/reviewed/notified.
- It is a business decision.
Add retries:
- Think of clock time, not number of retries.
- Consider supporting a manual retry mechanism for your support folks.
- See “user-flow vs background integrations below”.
Add a kill-switch:
- Ideally in the hands of a PM.
- Review “What is the acceptable business process when the integration fails”.
Always set timeouts in network calls:
- In the case of Apache HTTP client, at least connection and read timeout.
- Consider that while the thread is waiting for a response, it might be holding other resources (locks, db connections) hostage, which might affect unrelated requests.
- Review “What is the acceptable business process when the integration fails”.
- Note: when there is a read timeout while waiting for the server to respond, the client side does not know if the request was processed or not.
- Note: if the client application crashes, any in-flight request to the provider ends up in an unknown state from the client’s point of view.
  - Consider a retry/recovery mechanism when the client application starts up.
If you can influence it, encourage the provider to implement idempotent APIs:
- “At least once” semantics are way easier than “at most once”.
You might want to consider splitting one third party API into smaller independent integrations if:
- Some endpoints are more critical than others for your business process.
- The various endpoints have widely different latencies.
Monitoring:
- Call rate, error rate, latency.
- Logs:
  - All calls.
  - Request/Response body in the case of an error.
    - Careful with PII data.
  - Side note: client side monitoring is always better than server side monitoring, as the server might not see some requests if they never reach it, or it might miss monitoring data if it is struggling with load/network/crashing.
    - Server side monitoring is still required.
Set alerts:
- Useful distinction between errors:
  - 4xx:
    - It’s our fault.
    - Most likely something that we can do about it.
    - Daily/weekly report:
      - Daily the first few weeks.
      - Only alert if % is very high.
      - Very unlikely that a retry will help.
  - 5xx and timeouts/network errors:
    - Do not alert on each and every error:
      - Timeout and 5xx will happen and are normal.
    - Too many:
      - First few weeks might mean that you need to tweak your timeouts:
        
        Dark launch!
      - Escalate to the provider team.
    - Retries will help.
  - In both cases, keep a close eye the first time that you release an integration.
  - Remember that GraphQL needs additional error handling.
- If you are doing out of hours escalation, ensure that the person to be contacted when there is an alert is the person that thinks that the alert deserves an out of hours alert.
  - Most likely it is an issue with the provider, so it “only” needs to be escalated to their support.
  - Provide a dashboard for that alert that even a PM will understand.
  - Provide a phone contact to the third party support.
- See “user-flow vs background integrations below”.
- Consider doing load testing.
Security:
- SSL.
- IP allow-list.
- Credentials rotation:
  - If it is not automated, figure out who to contact.
- On start up, check that the credentials are valid.

Practices for user-flow vs background integrations

The main difference between API integration that is in a user-flow and integration that runs in some background process is that the user-flow integrations require a low latency, as it is unlikely that users are willing to wait long.

Background integrations

Multiple retries:
- Consider exponential backoff.
- Think in terms of clock time: For how many minutes/hours is the business process still meaningful? Or is it pointless if delayed more than X?
  - Business decision.
Alerts:
- Think in terms of clock time: how long can this integration be down before somebody should panic?
  - Business decision.
- Consider alerting before the process runs out of retries, so that once the issue is addressed the process will (hopefully) successfully retry.
Timeouts can be longer: up to minutes could be ok.
Batching is more likely to be useful.
No need for circuit-breakers.
Rate-limiting should result in traffic shaping.

User-flow integrations

You cannot wait long as there is a human looking at a spinning icon on the other side, so:
- No more than one retry.
- Short timeouts.
- This in terms of clock time: how long will the human be willing to wait for the result to appear on their screen before they think your app is broken?
- Most of the time, a few seconds tops.
Connection pools are a must:
- Unless your call rate is so low that it will make no difference.
- Each integration should have its own connection pool:
  - Do not share connection pools between different integrations.
- Configure the connection pool timeout and the TTL.
Circuit breakers are a must.
Consider stale-while-revalidating + stale-while-error for cached authorization tokens and data:
- Reduce extra latency when the token/data is stale.
- Better resilience: set refresh short enough that any transient error in the provider will not affect the user as the refresh will be retried several times before it comes really stale.
Alerts:
- Based on circuit breakers.
- Think in terms of clock time:
  - For how long a circuit breaker must be open before somebody should panic?
Most PaaS platforms have a hard time limit to process the request.
Consider:
- Moving to a completely or mostly async IO model, to avoid thread starvation.
- Java virtual threads!
- Sync data from the provider in a background process, store locally and serve from local store.
- Splitting processing into two steps:
  1. Trigger request to provider on first client request.
  2. Client to periodically poll to check if the response is ready.

Annoyances

A list of things that should never happen, and do not make sense at all, but you will need to live with.

Expect the integration to behave out of spec:
- Be kind with the provider’s developers.
A 200 response does not mean a successful response:
- Look at something inside the response to confirm that it was a successful response.
Health check endpoints tell you that the health check endpoint is working (or not):
- Making this call does not guarantee the success of the following HTTP request:
  - Health check implementations usually just return a 200 if the API server is up, but do not check that all the downstream dependencies are up and running.
    - Doing so is usually very expensive.
- Even if the health check endpoint does health checking of downstream servers and the downstream servers of the downstream servers, it is possible that by the time we send the second request, the API server or any of its dependencies is frozen or dead, or that there is some network issue on the path.
- Making this call makes the error handling and logic more complex.
- So, avoid.
- See “Side note: client side monitoring is always better than server side monitoring”.
Probably the dev environment of the provider is crap:
- Consider using their staging in all pre-production environments.
- Use their dev environment for initial development.

Client-side good practices when building third party API integrations