There has been a lot of discussion on the openstack-dev mailing list around the future of the Nova API. This document tries to cover the problems with the v2 API, what has been developed with the v3 API to address these and how we can resolve some of the issues with long term maintenance of two APIs. As well as strategies for coping with backwards incompatible changes into the future whilst still minimising maintenance overhead.
Note that this document does not attempt to compare the proposed V2 API only proposal against proceeding with V3 API.
The Nova v2 API is essentially the first version of the Nova API and has grown fairly organically over time. Nova has grown very quickly over a short period of time, and although in recent times we have started to raise the bar when it comes to reviewing changes to the API, historically adding features quickly has taken priority. Both the API itself, and especially how to add features to the API, has been inadequately documented. Often API features have been written by cut and pasting existing code and modifying it to suit, and as a result a lot of bugs and inconsistencies have arisen, depending on exactly what API features another was based on.
Work first started in Grizzly to address some of these problems, but it was quickly realised that many of the problems could not be fixed without backwards incompatible changes. A larger range of fixes was proposed and discussed at a couple of the Havana design summit sessions, and patches started merging soon after. Some of the issues are:
Although strictly speaking changing error return codes is a backwards incompatible change, We have traditionally made them anyway to the v2 API. However there are also cases in the v2 API where we return incorrect success codes. For example, there are cases where we return a 200 (OK) or 201 HTTP (CREATED) status code when on the backend this actually an asynchronous call which may fail. We should instead be returning a 202 (ACCEPTED).
Currently clients who trust that we return correct success codes are most likely assuming that operations have succeeded when under load they will fail occassionally.
User input is poorly or not validated at the API level. There are cases where parameters are not correctly checked at the API level which leads to more complicated or incorrect error handling at a lower level. In other cases extraneous data is not correctly rejected and so clients send data they think is being used, but is in fact being ignored.
This has not been helped by API samples which are used for our documentation being incorrect and so we have explicitly mislead users as to how to use our API. Although the API samples are run through the actual API code, the v2 API input validation has not picked up these problems and we have not noticed until we implemented strong input validation for the v3 API.
If we can't always get our API samples correct which are often written by the people writing the API features, its unlikely that people writing against our API are going to be significantly better at it. So we can not arbitrarily tighten the input validation as this would be a backwards incompatible change which would break existing applications.
Exceptions from Nova internal Nova code have not been consistently caught leading to poor quality error messages being sent to clients. Some cases were found which were in fact quite misleading about the real nature of the error
As the v2 API has grown very quickly without programmers and reviewers looking at the overall big picture of the Nova API or REST API design guidelines, it has grown quite self inconsistent. Simple things such as using CamelCase or Snake_Case across the API (we even end up with both within the same API call), a consistent way of naming extension or even naming concepts with Nova. For example in some areas we talk about instances, in others refer to the same thing as a server. Or use project and tenant interchangeably. As a result for some of these issues we end up with prominent explanations to reduce user confusion. It's not an issue for those who have become familiar with Nova's quirks, but it is a barrier to new users.
There are a number of issues with the v2 API implementation which lead to higher development and maintenance costs for the Nova API. Together with the primarily user facing issues this results in a code base which is a lot more fragile than we would otherwise like to have.
The v2 API plugin loader is a pretty basic Nova specific hand crafted implementation, which has lead to compromises in the way that API features are added. There are now python modules available which can do a lot of the hard work for us
The way that the core parts of the v2 API are implemented is quite different to they way that API extensions are implemented. This results in a higher learning curve and maintenance overhead for developers and reviewers as there are two ways of doing the same thing.
Partly as a result of the way the v2 API plugin loader works there are also areas where there is a lack of clean separation of API features, which results in much more complicated code and higher maintenance costs for the API.
The v2 API features (extensions or core) are not versioned. This has lead us to a quite ugly work around where we have to create a new extension that users can look for whenever we want to make a backwards compatible change.
Although poor input validation is primarily a user facing issue, it is also a maintenance cost for the development community. Because it makes it more likely that changes to the API layer or indeed even further will accidentally change our API and break applications.
Development of the v3 API started with some prototype code for the new API plugin framework at the Havana summit. Most of the API code was ported and merged during Havana, with a focus on testing, API input validation and more generic cleanups in Icehouse. Most of the code discussed to be done at the Icehouse summit has been completed and submitted for review, though there is still some that has not yet merged.
The tempest tests have been adjusted for the cleaned up v3 API and the v3 API tempest tests are part of the gate that all changes have to pass.
python-novaclient support has been implemented and merged.
As a very rough measure of the amount of effort required to get to where we for just changes in the Nova repository (so it doesn't count tempest or python-novaclient development), there are around 400+ V3 API related patches merged over both Havana and Icehouse. This does not include the extra "part 1" patches which were a result of making it easier to review initial V3 code, but would include some unrelated bug fixes. There are approximately 30 or so patches which were in the review queue until they were frozen pending the Nova API discussions.
There is no XML support in the v3 API as the v2 API support for XML has been marked deprecated. There is also no proxying of information to neutron, glance or cinder which can be instead queried from those services directly. It is expected that instead client libraries will handle that. python-novaclient does this where necessary and leaves the rest to the various service clients. Eventually it is expected that the openstack client will handle it as a unified interface, cutting back further on duplicated code. Retaining proxying support in the Nova api in the very long term means duplicated code between the Nova API and openstack clients which has extra maintenance overhead.
There were two reasons behind delaying the release of the v3 API in Icehouse. The first was that development of nova-network was unfrozen early in I-3. Originally nova-network was deprecated and so nova-network support was explicitly being removed from the v3 API. As a result of nova-network now becoming required, the nova-network related code will also have to be ported to the v3 API.
The second reason was that the new tasks API work was not completed in time. As discussed at the mid cycle meetup we did not want to compromise on the design of the tasks API and it requires non backwards compatible changes to several API areas. To reduce the risk of getting this very last minute API changes wrong and having to live with those bugs for a very long time, it was decided to defer marking the v3 API as supported rather than experimental.
Strong input validation has been added using JSON schema. The v3 API is very strict about the rejecting malformed data, whether it be part of the expected data, or extraneous data. It also has some nice potential side-effect improvements.
We now correctly return 202 for asynchronous calls where we don't know if the operation requested will succeed or not
The v3 API has made the backwards incompatible changes such that weird non-REST-like design quirks of the API have been removed, and there is uniform naming format and concepts across the API. Eg extensions are named consistently, we don't use different names for the same concepts and snake_case is used uniformly for data parameters received and returned.
For example the v2 API version of a detailed server list looks like:
v2 API | v3 API |
---|---|
|
|
The number of extensions blew out in v2 API because of the lack of versioning. So the v3 API has lot fewer extensions as functionality is merged into the relevant basic extensions. And with versioning this should not re-occur.
There is now no longer any difference between how API features are implemented if they are core or extensions. The framework also allows for better isolation of the code from API different parts of the API and it is no longer necessary to modify parts of the core API to support extensions. This reduces the maintance overhead for the API code. For example, compare the v2 and v3 API versions of the servers create method (note that although the input validation patch for this plugin has been written it has not merged in this code and so there is still some input validation in the v3 version)
v2 API | v3 API |
---|---|
|
|
The JSON Schema input validation abstracts the validation of client supplied input from where it is used. So for example, the JSON schema for the evacuate action looks like:
evacuate = {
'type': 'object',
'properties': {
'evacuate': {
'type': 'object',
'properties': {
'host': parameter_types.hostname,
'on_shared_storage': parameter_types.boolean,
'admin_password': parameter_types.admin_password,
},
'required': ['host', 'on_shared_storage'],
'additionalProperties': False,
},
},
'required': ['evacuate'],
'additionalProperties': False,
}
It clearly defines what parameters are valid, the format of those parameters, which are required and which are optional, and if additional parameters can be supplied. Doing input validation this way also encourages more consistent input validation through shared parameter types rather than API extension specific local regular expressions. Because the validation occurs through a decorator the part of the API which assembles the client supplied data becomes much simpler:
v2 API | v3 API |
---|---|
|
|
Not only is the code simpler for v3, but the input validation is much better compared to v2. And it is significantly easier to review both the API logic and the input validation when they are separated in this way.
What error status codes may be returned from a method is explicitly specified via a decorator. This helps address a few problems. The first is that we can now automate the documentation process of what error codes may be returned which will make the documentation more reliable. We also pick up much earlier in the development process Nova exceptions which are unhandled. And I believe the requirement for these to be explicitly specified helps remind both developers and reviewers to think about potential error paths rather than just the normal success paths.
Allows a bit of extra insurance for deployers to ensure that their configuration doesn't accidentally change underneath them when updating Nova. Previously if the plugin loader found anything that looked like a plugin it just loaded it. There is no more explicit control available over what gets loaded and what doesn't.
Part of the API v3 work has been moving policy checks which have often been done at the db or compute api level to the REST API. This has the benefit of catching permission errors as soon as possible reducing the amount of unwinding of work that needs to be done when policy checks are done at the low level. Not only does it remove ambiguity around what a policy change actually means for people accessing the API it also fixes situations where changing policies for the API is not able to have full effect because of secondary hard-coded admin or policy checks at a lower level.
At the request of some deployers, in some cases extensions were split into multiple extensions to allow them to more finely define which API features they want to use.
As the input validation and error handling in v3 API has been improved, the unittests coverage for these cases has also been added to ensure we don't accidentally regress.
One of the issues from creating the v3 API and the v2 API needing to be supported for a much longer period than the standard 1 cycle deprecation period is the dual maintenance costs. That is, when Nova internal APIs change it may in cases be necessary to make the corresponding changes in the v2 API code, possibly the EC2 API code, and now the v3 API code as well.
The recent objects work is an example where this has occurred. The API layer is in most cases a very thin layer on top of Nova internals and follows the general layout of:
There are cases where it is more complex than this, but they all follow the general process. So an example of one recent objects patch which required changing both the v2 and v3 API code is converting unrescue to support objects. The diffstat for the patch is:
diff ec78b42d7b7e9da99ba063cccc8a4f6d0aa7c8e5^..ec78b42d7b7e9da99ba063cccc8a4f6d0aa7c8e5 | diffstat
api/openstack/compute/contrib/rescue.py | 2 +-
api/openstack/compute/plugins/v3/rescue.py | 3 ++-
compute/manager.py | 14 +++++++-------
compute/rpcapi.py | 12 ++++++++----
tests/compute/test_compute.py | 8 +++++---
tests/compute/test_rpcapi.py | 2 +-
6 files changed, 24 insertions(+), 17 deletions(-)
So changes were required to both the v2 and v3 versions of rescue plugin. The patches to those parts look like:
diff --git a/nova/api/openstack/compute/contrib/rescue.py b/nova/api/openstack/compute/con
index fe31f2c..0233be2 100644
--- a/nova/api/openstack/compute/contrib/rescue.py
+++ b/nova/api/openstack/compute/contrib/rescue.py
@@ -75,7 +75,7 @@ class RescueController(wsgi.Controller):
"""Unrescue an instance."""
context = req.environ["nova.context"]
authorize(context)
- instance = self._get_instance(context, id)
+ instance = self._get_instance(context, id, want_objects=True)
try:
self.compute_api.unrescue(context, instance)
except exception.InstanceInvalidState as state_error:
diff --git a/nova/api/openstack/compute/plugins/v3/rescue.py b/nova/api/openstack/compute/
index 5ae876b..66b4c17 100644
--- a/nova/api/openstack/compute/plugins/v3/rescue.py
+++ b/nova/api/openstack/compute/plugins/v3/rescue.py
@@ -77,7 +77,8 @@ class RescueController(wsgi.Controller):
"""Unrescue an instance."""
context = req.environ["nova.context"]
authorize(context)
- instance = common.get_instance(self.compute_api, context, id)
+ instance = common.get_instance(self.compute_api, context, id,
+ want_objects=True)
try:
self.compute_api.unrescue(context, instance)
except exception.InstanceInvalidState as state_error:
The extra dual maintenance burden ends up being an additional one line trivial change amongst a larger patch to change the infrastructure underneath. Though sometimes there are corresponding changes to tests as well. A lot of the changes for objects have been similar but some are more complicated. In those cases we can remove the dual maintenance burden by refactoring to have the v2 and v3 API code call into a common method. This removes the dual maintenance burden for Nova internal API changes.
In order to reduce the dual maintenance costs in the long term and reduce LOC in Nova, one approach would be to implement the v2 API on top of the v3 codebase primarily as decorators. This is much easier to achieve we allow this implementation to have strong input validation as lot of the translation can be done using just JSON schema. See a simple proof of concept patch here https://review.openstack.org/#/c/77105/. Essentially we'd have a v2.1 which is the same as v2 except for strong input validation.
This technique allows us to eventually have one code base for v2 and v3 and at the same time preserve backwards compatibility for v2 as we can translate on both the input and output. It has the significant maintenance advantage of keeping the handling of v2 and v3 API input and output handling separate from each other. Validation testing for this API is fairly straightforward as we have existing tempest and unittests tests for the v2 API and we are only concerned about verifying that correct input behaviour remains the same.
It also provides for a good transition strategy from v2 to v2.1. As the original v2 API code remains untouched we do not risk accidentally changing the semantics of the v2 API code. The only applications which could break in the transition from v2 to v2.1 would be those which are currently misusing the API. The client applications would be able to have a reasonable time period where they can quite easily test/verify against the v2.1 API before v2 is deprecated. And there is quite a strong incentive for doing so because misusing the API is a sign that they may have a bug in their program.
This would involve implementing a separate service which essentially translated v2 REST API requests into v3 ones, ensuring that only valid requests are passed to the v2 API and did the inverse when returning data to the caller. It would also have to implement proxying to neutron/cinder/glance where appropriate. Once this is proven as stable, like option 1, the legacy v2 API code could be removed. This would be one way to retain a v2 API with poor input validation but reduce maintenance costs.
We need to be a lot more careful about API changes in the future and not just consider a change in isolation, but its impact overall to the Nova API. However, I think we have to accept that we will still make mistakes and need a strategy of how we handle that.
Whether we use version headers or url path differences, either are major version revs and I think one of the most important priorities should be to keep our code base clean in the long term and not end up with a growing number of interleaved version tests in the API code. Wherever possible I think we should aim to keep the canonical latest version of the API code clean and separate from code needed to keep legacy API versions supported. As this lowers our long term maintenance costs.
Last edited 2014/03/04 05:52:43 UTC