WebSemaphore Blog - Concurrency In Systems Integration

Abstract

Concurrent access to resources presents a frequent challenge in systems integration. WebSemaphore is a scalable, serverless solution that aims to address a niche set of concerns related to concurrency and time-optimized allocation of resources.

Concurrency in systems integration context

More often than not, there are some concurrency limits to every system or API we need to communicate with. Examples include the max number of simultaneous connections to a database, actual maximum performance capacity of a web server or artificial limits imposed by a provider based on your usage plan.

While some products and frameworks provide tools to manage concurrency, they serve better customers that are already using a relevant part of the stack.

Some examples are HashiCorp Consul that requires setting up a cluster and provides no queueing mechanism, or Redis that also requires a cluster and dealing with special primitives. If the tooling exists and is already maintained, works well within an enterprise or department and can address the issue, the enterprise architects will in most cases choose the shortest path and use their default tool.

The premise of WebSemaphore is that none of the existing solutions are focused on concurrency and optimization as the primary concern, in particular none address it in a cloud-first, SaaS/IaC manner.

Microservices-based architectures in their variety are here to stay, with serverless technologies contributing to and facilitating the trend, and are increasingly used by companies of all sizes. As a result, systems in different parts of organizations evolve in an independent manner, essentially similar to startups in terms of their technological independence. Due to the varying needs and competencies, they often use different stacks and platforms, interconnected in complex integration graphs.

A sample integration graph

Fig 1. An actual integration graph demonstrating 20 systems with 46 integrations in only 2 enterprise projects. The blurred items are confidential.

One of the outcomes is that a language or environment-specific implementation and associated complexity can stand in the way of adoption of such solutions across the different stacks.

We believe the IaC microservices in the style of modern clouds is the way of the future as they abstract away the technology and implementation details, instead providing a focused set of capabilities that can be easily integrated into any solution.

WebSemaphore fits the definition with its specific, restricted scope and stack independence, and contributes its functionality to the overall mesh of IaC concepts in the cloud/serverless ecosystem.

WebSemaphore

The value proposition of WebSemaphore is to provide an IaC-style, serverless, zero-setup solution that enables developers to seamlessly solve concurrency challenges.

Due to its minimalistic design, it can upgrade an existing naive/retry-based flow to an async flow by applying only a handful implementation changes along the lines of Fig 3 as shown below.

When do we need WebSemaphore

Consider a flow F where consumer C wants to access a limited resource R from multiple processes. The processes could either compete for the resource by trial and failure or queue on any metric, time-ascending being the most frequent and identical to FIFO ordering.

Fig 2. Flow F. Consumer C is invoking Provider R directly. Note that the Consumer and Provider may be or not be in different organizations

No queueing mechanism / Sync

In lack of a control mechanism on the consumer side, the next best way to handle capacity failures is retry with any strategy, exponential backoff being one of the most popular choices.

It’s easy to see how transactions may be lost with this approach, especially during activity peaks. This may be the goal for some scenarios, such as locking a unique item in an eshop while it’s in someone’s active basket - we don’t want anyone to be able to lock it unless the current order expires. However, it fits poorly with the order processing flow. You don’t really want your customer to click “retry” during payment - a much better strategy, when applicable, is to accept the order and process it when capacity is available.

Fig 3. Access pattern without a queueing mechanism. The system rejects overcapacity traffic during peaks, and is idle during the lows. For a live, configurable simulation check out https://www.websemaphore.com/demos/simulation

From the consumer’s perspective, in cases where retries are undesirable, they may still be chosen as an intermediate solution due to the relative complexity of the other choice. One might even get away with this as long as there is capacity excess most of the time.

From the provider’s perspective, the gaps in processing translate into lost business opportunity and reduced return on investment. Moreover, if the system has no awareness of its own capacity, excessive requests can risk the execution of the currently processing requests, similarly to what happens in a DDoS attack.

With a queueing mechanism / Async

The async approach breaks the flow into (1) initiator and (2) processor steps. The initiator invokes the processor indirectly via an intermediate mechanism that controls the throughput.

Fig 4. Asynchronous Flow F. Consumer C is invoking Provider R indirectly

To allow accepting requests regardless of capacity, a queue is needed. However, a queue alone does nothing to satisfy the concurrency control requirement, and that’s where the need for an atomic, consistent counter emerges.

Below is the chart for running the data from Fig 3 through an asynchronous semaphore.

Fig 5. Async request simulation - semaphore performance over time. Note how lockValue stays high long after the traffic peaks and how eventually peaks of waiting messages are exhausted.

Is async better than sync?

Like in many “which is better” debates the answer depends on the situation. Here - on the use case. But if you are using sync where async would be reasonable, you are missing on business opportunities. How much you are missing depends on the traffic patterns. For an illustration, here are the execution totals specifically for the simulation presented in Fig 3 and Fig 5 respectively:

Fig 5. Sync (left) and Async (right) request simulation - totals over time. The final result on the left side is 42.9% rejections. Please note this strongly depends on the actual environment and flow. There are 0 rejections on the right. Check our simulation for an interactive, personalizable comparison.

If you find yourself implementing a variation of the async mode and your tools don’t cut it for you out of the box, consider using WebSemaphore.

WebSemaphore doesn’t force you to choose sync over async - both modes can be used interchangeably on the same semaphore without breaking consistency. Thus for example, existing solutions can stay with the sync model and phase the migration to async mode over the various critical paths in the application.

Some queues are more equal

There comes a time when the solution requires predictable concurrency management. Armed with a queueing solution laying around the project, the adventurous engineer starts digging into how to achieve this. Another astute developer realizes that we are in fact in the consensus land and has ZooKeeper used elsewhere in the organization. Now she needs to figure out how to use its primitives to achieve the result; find a way to keep persistent connections - especially challenging in a serverless model; finally, make sure the cluster is sufficient for the traffic. A third developer reads an article about how this can be achieved with Redis or DynamoDB.

Congratulations, you (or your engineers) have just gotten completely distracted from your use case - see you in a few weeks or months. And it's not about skills or competencies - rather, it's about focus.

Assuming they had the time, budget and skill, what would they build?

Let’s list the desired features based on the discussion above. We would like to have an self-contained, scalable service that installs a queue in front of a semaphore, making them work as a synchronized unit, so we fulfill the following

Requirements

Concurrency Control:

Limit concurrent throughput
Allow lock acquisition for an arbitrary duration
Maximize capacity usage over time

Queue Management:

Pause processing while keeping ingress
Multiplexable (think SQS FIFO GroupId)

Failure Handling and Recovery:

Recover from failure, including traffic reprocessing and redriving

Setup and Integration:

Easily reconfigure the queue, on-the-fly where it makes sense
Minimal to no setup
A simple API that feels native in the embedding code
Stack-independent
Scalability: effortless, preferably serverless

The range of queueing mechanisms is wide, and there are also a few self-standing respected products providing semaphores within a mostly unrelated feature set.

For example, the few queueing solutions that allow suspension are not serverless and the ones that are serverless will only let messages wait for a certain maximum amount of time. Systems including semaphoring/signaling mechanisms in their turn mostly don’t provide queues or require their clients (or their proxies) to maintain a persistent connection.

Some larger integration products would potentially include similar capabilities but cause the overhead of committing to an overly complex solution. This comes with a higher license fee for the extra features. It will also include a learning curve and at least a partial disruption of the stack (tools, language, often Java). Unless it’s already used where you are, this is comparable to buying and then learning to use a cannon to shoot a fly. Additionally, many of the established products in this space would be better suited for ETL and scheduled jobs rather than near real-time integrations. Finally, they target complex integration environments at organization level whereas you may be solving a specific challenge where the integration is but one of many implementation details.

While the landscape of available solutions is diverse, few if any of the existing offerings explicitly target concurrency control and throughput optimization in near real-time, distributed communication as their primary area of focus. This highlights an opportunity for a solution like WebSemaphore to address this specific need.

The unknown unknowns

Concurrency issues are not typically stated as initial business/project requirements, unless they represent key features of a product. Instead, they will surface during analysis iterations. How soon - depends on the architect’s level of experience, time to collect and analyze project environment data, the technological readiness and policies of the organization.

To see how a simple, barely technical set of requirements unfolds into a distributed concurrency problem in a global project, see my next article coming up next week.

WebSemaphore is intended to organically merge into an existing/developing solution in the form of a few API calls, and cause minimum disruption to the existing development flow.

What concerns does WebSemaphore address

Below is a summary of the benefits of using the flow described above.

Flow consistency: Concurrency limiting/isolation of long flows (such as those including: AWS State machine executions, long computations, physical devices or offline activities).

Data consistency: FIFO execution preserving order of events for event-based architectures

Concurrency control: Adhering to concurrency limits implied by process, hardware or provider limits.

Failure tolerance / Disaster Recovery: The service will handle temporary outages and allow suspension of processing for failing streams until resolution. Ingress of inbound events for such streams will continue. Processing is simple to restart once the issue is resolved. Redirecting traffic to a functioning destination is an upcoming feature.

Near-real-time and optimal resource utilization: processing capacity must not sit idle. We should process almost immediately most of the time.

Dynamic capacity management: Where scaling is possible (such as with EC2 autoscaling), it may take some time to provision additional processing units. Imagine that such provisioning takes a longer time due to some online or even human-participated process such as approval. In these cases WebSemaphore is the ideal tool to provide the elasticity required while the extra capacity is stood up. Once the extra units are available, a simple configuration call is sufficient for WebSemaphore to start processing at a higher rate and catch up with the backlog.

Conclusion

In this article we looked at some specialized challenges in enterprise systems integration and the unique feature set WebSemaphore offers to address them. We concluded with a concise summary of the concerns WebSemaphore aims to address.

WebSemaphore is in beta and we are actively looking for pilot customers to try out all of the features. As an early partner you will have the exclusive opportunity to influence product priorities as we make it fit your needs.

Looking forward to hearing from you at https://www.websemaphore.com/contact.