Categories
Architecture Software

Using the POST method for HTTP search queries

When designing RESTful services where data is queried, we tend to map this querying functionality to an endpoint that uses a HTTP GET method. This is good because GET is specifically designed for data retrieval.

However, there are scenarios where this isn’t a sufficient solution. For example, if the search query becomes too complex — or long — to reasonably specify in the URI, we may want to send a request body instead: putting the query inside the payload. This isn’t part of the HTTP specification though:

Not all requests have one: requests fetching resources, like GET, HEAD, DELETE, or OPTIONS, usually don’t need one.

so it is unreliable to assume that you can send data in this way and have it interpreted correctly by servers.

Another situation is when the query contains data that you don’t want to expose in the URI: if there is personally identifiable information (e.g. a person’s home address, which you are using for an insurance query), then you definitely don’t want to make yourself liable in the event that a rogue attacker gets access to your server logs.

So what can we do?

Just POST?

Initially you might be thinking is to just use POST: it’s designed specifically to allow you to send data in the request body, so why not use it?

The problem is that we lose many of the benefits that GET provided for us in HTTP’s semantics:

  • Your search query is no longer addressable (e.g. you can’t bookmark the results). This also doesn’t play well with browsers in terms of moving backwards or forwards through a user’s browser history.
  • POST is not safe, meaning that it can change the state of the server. This doesn’t seem right for the semantics of querying: I’m requesting data in a read-only operation, so why should the server’s state change?
  • What happens when you want to provide additional filtering, such as paginated results? With a GET you’d supply a query parameter for the page, but this makes less sense in the context of a POST. You can still do this, but semantically it’s awkward, and the already non-addressable query gets an even strange looking URI.
    You can opt for putting it into the request payload instead, but this is going down the rabbit hole of putting a square peg into a round hole: you’d be designing around a shortcoming that came out of choosing an inappropriate HTTP method.

This is a very RPC-centric solution: it thinks of "search" as being a function that gets executed over a network, rather than a resource that gets accessed via said network.

So POST first … and then GET later

If we want to think of resources, we instead have 2 endpoints:

  1. A POST /queries endpoint which creates our search query, returning a 201 Created status code and query ID for finding the results. In the request payload you specify the query parameters.

    // Status: 201 Created
    // Content-Type: application/json
    // Content-Location: /queries/c2f3f217-03b0-4ced-953d-071adeaffbb8/results
    {
        "query_id": "c2f3f217-03b0-4ced-953d-071adeaffbb8",
        "meta": {
            "links": [
                {
                    "href": "/queries/c2f3f217-03b0-4ced-953d-071adeaffbb8/results",
                    "method": "GET",
                    "rel": "results"
                }
            ]
        }
    }

    We’re also providing metadata for finding the resource in the response, so that the consumer can follow the response to get there.

  2. A GET /queries/{query_id}/results endpoint which we hit for retrieving the results of the search:
    // Status: 200 OK
    // Content-Type: application/json
    {
        "results": [
            // ...
        ],
        "meta": {
            "status": "COMPLETE",
            "total": 10,
            "links": [
                {
                    "href": "/queries/c2f3f217-03b0-4ced-953d-071adeaffbb8/results",
                    "method": "GET",
                    "rel": "self"
                }
            ]
        }
    }

This provides us with a number of interesting benefits:

  1. We get to use POST for determining whatever payload structure we want for the query itself. So you can choose to send plain text, JSON, XML, etc… – you’re not constrained by query parameters any more.
  2. Since the final resource is fully addressed, we retain the benefits of using GET for the original search query!
  3. Notice that we have a meta.status property for describing the status of the GET query. This indicates that we can have an asynchronous search query.
    Suppose that you needed to do multiple requests behind the scenes before aggregating the results, and that this process was slow. Rather than keeping the connection with the client open for the whole request, your initial POST gives the client a resource that they can poll until the results are ready.
    You complete the task in the background as a side-effect of their POST. While its still computing, you leave the status as PENDING, and once completed you mark it as COMPLETE. If there was an error, give a FAILED state, and add some metadata explaining what went wrong.
    This is great for communicating status updates in a user interface – a consumer can have a loading bar which fills up as the query progresses through different states; they can be shown certain results early before the full query is complete, etc…
  4. Additional filtering/sorting can still be done using query parameters on the GET (e.g. GET /queries/c2f3f217-03b0-4ced-953d-071adeaffbb8/results?provider=1&page=3). In fact, if you’re caching the result set on the server side, you’ll save server load for doing these types of sub-queries over just having just the POST (and re-trying the whole query with slightly different parameters).

Considerations

But as the saying goes, there is no free lunch. There are some important tradeoffs to consider when implementing this type of solution.

Ease-of-use

A consumer of your API may find that using multiple endpoints to get a set of results is more difficult than they find worthwhile. This is a fair consideration: at the end of the day we design our APIs to be consumed by something or someone, and APIs shouldn’t be more difficult to integrate than is necessary just because of a dogmatic adherence to REST.

It is on you as the designer of the API to then make your interface as easy as possible to consume. HTTP provides helpful places for providing metadata in its specification (the Content-Location header is a good example), so leverage the existing specification where possible. Beyond this, you should be giving your users as much information as is necessary to complete the required steps for their intended use. Conceptually, your consumer will just need to hit the POST resource and then poll the GET resource until an end state is reached (success/failure).

Storage

By virtue of providing a resource ID and pollable content, you will need to store the query itself, as well as the status of the process that the query started. I don’t think that storing queries/process statuses is a big deal, because there are plenty of good storage solutions out there for this purpose (e.g. Redis is a great solution for this, and has a great deal of support in many programming languages). However it is worth bearing in mind the lifespan of said queries.

This is highly dependent on what the queries are being used for: some are reasonable to keep as permanent (e.g. a user’s search preferences for a membership site), others are more appropriate as temporary (e.g. finding quotes for car insurance). If you’re going to make it temporary, then do your consumers a favour and provide an expiry date for the resource in the response payload (e.g. in the metadata), so they know not to continue requesting it after that period of time (and back to Redis: it allows you to automatically delete data after a specific date very easily).

Furthermore, you can provide an explicit DELETE /queries/{query_id} endpoint for cleaning up these resources.

Query modifications

What if the client messed up their original POST and wants to try again…do they need to create a new query? Isn’t it wasteful to allow my storage to be polluted with useless queries?

PATCH and PUT are the existing HTTP methods for modifying resources. We can use them to change the query and trigger the background processes again using new query data. If you’re going to make the client use the same payload structure as they did for POST, then use PUT /queries/{query_id}. If you’re going to allow the client to pass a "diff" (e.g. fix their address in the payload but leave the rest the same), then use PATCH /queries/{query_id} and ensure that the schema describes a sensible way of applying said diff.

Take notice of the fact that I separated the search results (/queries/{query_id}/results) from the query resource itself (/queries/{query_id}). It doesn’t make much sense to PUT, PATCH or DELETE the actual results themselves, but it does make sense to do so on the query that generated said results. Be careful in how you name your resources, and be considerate to your clients with how to discover them via the API and documentation. HATEOAS is your friend!

// GET /queries/536d3ece-f7a6-45d9-83f6-40c7003aafb1
// Content-Type: application/json
{
    "created": 1594927540,
    "expires": 1594930230,
    "query": {
        "car_make": "toyota",
        "max_age": 10,
        "address": {
            "postcode": "...",
            "building_number": 42,
            // ...
        },
        "proximity_miles": 5,
    },
    "meta": {
        "links": [
            {
                "href": "/queries/536d3ece-f7a6-45d9-83f6-40c7003aafb1/results",
                "method": "GET",
                "rel": "results"
            },
            {
                "href": "/queries/536d3ece-f7a6-45d9-83f6-40c7003aafb1",
                "method": "PUT",
                "rel": "modify"
            },
            {
                "href": "/queries/536d3ece-f7a6-45d9-83f6-40c7003aafb1",
                "method": "DELETE",
                "rel": "delete"
            }
        ]
    }
}

Helpful resources