Release: 1.4.0b1 pre release | Release Date: unreleased

SQLAlchemy 1.4 Documentation

SQLAlchemy 1.4 Documentation

Changes and Migration

Project Versions

What’s New in SQLAlchemy 1.4?

About this Document

This document describes changes between SQLAlchemy version 1.3 and SQLAlchemy version 1.4.

Version 1.4 is taking on a different focus than other SQLAlchemy releases in that it is in many ways attempting to serve as a potential migration point for a more dramatic series of API changes currently planned for release 2.0 of SQLAlchemy. The focus of SQLAlchemy 2.0 is a modernized and slimmed down API that removes lots of usage patterns that have long been discouraged, as well as mainstreams the best ideas in SQLAlchemy as first class API features, with the goal being that there is much less ambiguity in how the API is to be used, as well as that a series of implicit behaviors and rarely-used API flags that complicate the internals and hinder performance will be removed.

Behavioral Changes - General

Many Core and ORM statement objects now perform much of their validation in the compile phase

A major initiative in the 1.4 series is to approach the model of both Core SQL statements as well as the ORM Query to allow for an efficient, cacheable model of statement creation and compilation, where the compilation step would be cached, based on a cache key generated by the created statement object, which itself is newly created for each use. Towards this goal, much of the Python computation which occurs within the construction of statements, particularly the ORM Query, is being moved to occur only when the statement is invoked. This means that some of the error messages which can arise based on arguments passed to the object will no longer be raised immediately, and instead will occur only when the statement is invoked.

Error conditions which fall under this category include:

  • when a CompoundSelect is constructed (e.g. a UNION, EXCEPT, etc.) and the SELECT statements passed do not have the same number of columns, a CompileError is now raised to this effect; previously, a ArgumentError would be raised immediately upon statement construction.

  • To be continued…

API Changes - Core

A SELECT statement is no longer implicitly considered to be a FROM clause

This change is one of the larger conceptual changes in SQLAlchemy in many years, however it is hoped that the end user impact is relatively small, as the change more closely matches what databases like MySQL and PostgreSQL require in any case.

The most immediate noticeable impact is that a select() can no longer be embedded inside of another select() directly, without explicitly turning the inner select() into a subquery first. This is historically performed by using the SelectBase.alias() method, which remains, however is more explicitly suited by using a new method SelectBase.subquery(); both methods do the same thing. The object returned is now Subquery, which is very similar to the Alias object and shares a common base AliasedReturnsRows.

That is, this will now raise:

stmt1 = select([user.c.id, user.c.name])
stmt2 = select([addresses, stmt1]).select_from(addresses.join(stmt1))

Raising:

sqlalchemy.exc.ArgumentError: Column expression or FROM clause expected,
got <...Select object ...>. To create a FROM clause from a <class
'sqlalchemy.sql.selectable.Select'> object, use the .subquery() method.

The correct calling form is instead:

sq1 = select([user.c.id, user.c.name]).subquery()
stmt2 = select([addresses, sq1]).select_from(addresses.join(sq1))

Noting above that the SelectBase.subquery() method is essentially equivalent to using the SelectBase.alias() method.

The above calling form is typically required in any case as the call to SelectBase.subquery() or SelectBase.alias() is needed to ensure the subquery has a name. The MySQL and PostgreSQL databases do not accept unnamed subqueries in the FROM clause and they are of limited use on other platforms; this is described further below.

Along with the above change, the general capability of select() and related constructs to create unnamed subqueries, which means a FROM subquery that renders without any name i.e. “AS somename”, has been removed, and the ability of the select() construct to implicitly create subqueries without explicit calling code to do so is mostly deprecated. In the above example, as has always been the case, using the SelectBase.alias() method as well as the new SelectBase.subquery() method without passing a name will generate a so-called “anonymous” name, which is the familiar anon_1 name we see in SQLAlchemy queries:

SELECT
    addresses.id, addresses.email, addresses.user_id,
    anon_1.id, anon_1.name
FROM
addresses JOIN
(SELECT users.id AS id, users.name AS name FROM users) AS anon_1
ON addresses.user_id = anon_1.id

Unnamed subqueries in the FROM clause (which note are different from so-called “scalar subqueries” which take the place of a column expression in the columns clause or WHERE clause) are of extremely limited use in SQL, and their production in SQLAlchemy has mostly presented itself as an undesirable behavior that needs to be worked around. For example, both the MySQL and PostgreSQL outright reject the usage of unnamed subqueries:

# MySQL / MariaDB:

MariaDB [(none)]> select * from (select 1);
ERROR 1248 (42000): Every derived table must have its own alias


# PostgreSQL:

test=> select * from (select 1);
ERROR:  subquery in FROM must have an alias
LINE 1: select * from (select 1);
                      ^
HINT:  For example, FROM (SELECT ...) [AS] foo.

A database like SQLite accepts them, however it is still often the case that the names produced from such a subquery are too ambiguous to be useful:

sqlite> CREATE TABLE a(id integer);
sqlite> CREATE TABLE b(id integer);
sqlite> SELECT * FROM a JOIN (SELECT * FROM b) ON a.id=id;
Error: ambiguous column name: id
sqlite> SELECT * FROM a JOIN (SELECT * FROM b) ON a.id=b.id;
Error: no such column: b.id

# use a name
sqlite> SELECT * FROM a JOIN (SELECT * FROM b) AS anon_1 ON a.id=anon_1.id;

Due to the above limitations, there are very few places in SQLAlchemy where such a query form was valid; the one exception was within the Oracle dialect where they were used to create OFFSET / LIMIT subqueries as Oracle does not support these keywords directly; this implementation has been replaced by one which uses anonymous subqueries. Throughout the ORM, exception cases that detect where a SELECT statement would be SELECTed from either encourage the user to, or implicitly create, an anonymously named subquery; it is hoped by moving to an all-explicit subquery much of the complexity incurred by these areas can be removed.

As SelectBase objects are no longer FromClause objects, attributes like the .c attribute as well as methods like .select(), .join(), and .outerjoin() upon SelectBase are now deprecated, as these methods all imply implicit production of a subquery. Instead, as is already what the vast majority of applications have to do in any case, invoking SelectBase.alias() or SelectBase.subquery() will provide for a Subquery object that provides all these attributes, as it is part of the FromClause hierarchy. In the interim, these methods are still available, however they now produce an anonymously named subquery rather than an unnamed one, and this subquery is distinct from the SelectBase construct itself.

In place of the .c attribute, a new attribute SelectBase.selected_columns is added. This attribute resolves to a column collection that is what most people hope that .c does (but does not), which is to reference the columns that are in the columns clause of the SELECT statement. A common beginner mistake is code such as the following:

stmt = select([users])
stmt = stmt.where(stmt.c.name == 'foo')

The above code appears intuitive and that it would generate “SELECT * FROM users WHERE name=’foo’”, however veteran SQLAlchemy users will recognize that it in fact generates a useless subquery resembling “SELECT * FROM (SELECT * FROM users) WHERE name=’foo’”.

The new SelectBase.selected_columns attribute however does suit the use case above, as in a case like the above it links directly to the columns present in the users.c collection:

stmt = select([users])
stmt = stmt.where(stmt.selected_columns.name == 'foo')

There is of course the notion that perhaps .c on SelectBase could simply act the way SelectBase.selected_columns does above, however in light of the fact that .c is strongly associated with the FromClause hierarchy, meaning that it is a set of columns that can be directly in the FROM clause of another SELECT, it’s better that a column collection that serves an entirely different purpose have a new name.

In the bigger picture, the reason this change is being made now is towards the goal of unifying the ORM Query object into the SelectBase hierarchy in SQLAlchemy 2.0, so that the ORM will have a “select()” construct that extends directly from the existing select() object, having the same methods and behaviors except that it will have additional ORM functionality. All statement objects in Core will also be fully cacheable using a new system that resembles “baked queries” except that it will work transparently for all statements across Core and ORM. In order to achieve this, the Core class hierarchy needs to be refined to behave in such a way that is more easily compatible with the ORM, and the ORM class hierarchy needs to be refined so that it is more compatible with Core.

#4617

All IN expressions render parameters for each value in the list on the fly (e.g. expanding parameters)

The “expanding IN” feature, first introduced in Late-expanded IN parameter sets allow IN expressions with cached statements, has matured enough such that it is clearly superior to the previous method of rendering IN expressions. As the approach was improved to handle empty lists of values, it is now the only means that Core / ORM will use to render lists of IN parameters.

The previous approach which has been present in SQLAlchemy since its first release was that when a list of values were passed to the ColumnOperators.in_() method, the list would be expanded into a series of individual BindParameter objects at statement construction time. This suffered from the limitation that it was not possible to vary the parameter list at statement execution time based on the parameter dictionary, which meant that string SQL statements could not be cached independently of their parameters, nor could the parameter dictionary be fully used for statements that included IN expressions generally.

In order to service the “baked query” feature described at Baked Queries, a cacheable version of IN was needed, which is what brought about the “expanding IN” feature. In contrast to the existing behavior whereby the parameter list is expanded at statement construction time into individual BindParameter objects, the feature instead uses a single BindParameter that stores the list of values at once; when the statement is executed by the Engine, it is “expanded” on the fly into individual bound parameter positions based on the parameters passed to the call to Connection.execute(), and the existing SQL string which may have been retrieved from a previous execution is modified using a regular expression to suit the current parameter set. This allows for the same Compiled object, which stores the rendered string statement, to be invoked multiple times against different parameter sets that modify the list contents passed to IN expressions, while still maintaining the behavior of individual scalar parameters being passed to the DBAPI. While some DBAPIs do support this functionality directly, it is not generally available; the “expanding IN” feature now supports the behavior consistently for all backends.

As a major focus of 1.4 is to allow for true statement caching in Core and ORM without the awkwardness of the “baked” system, and since the “expanding IN” feature represents a simpler approach to building expressions in any case, it’s now invoked automatically whenever a list of values is passed to an IN expression:

stmt = select([A.id, A.data]).where(A.id.in_([1, 2, 3]))

The pre-execution string representation is:

>>> print(stmt)
SELECT a.id, a.data
FROM a
WHERE a.id IN ([POSTCOMPILE_id_1])

To render the values directly, use literal_binds as was the case previously:

>>> print(stmt.compile(compile_kwargs={"literal_binds": True}))
SELECT a.id, a.data
FROM a
WHERE a.id IN (1, 2, 3)

A new flag, “render_postcompile”, is added as a helper to allow the current bound value to be rendered as it would be passed to the database:

>>> print(stmt.compile(compile_kwargs={"render_postcompile": True}))
SELECT a.id, a.data
FROM a
WHERE a.id IN (:id_1_1, :id_1_2, :id_1_3)

Engine logging output shows the ultimate rendered statement as well:

INFO sqlalchemy.engine.base.Engine SELECT a.id, a.data
FROM a
WHERE a.id IN (?, ?, ?)
INFO sqlalchemy.engine.base.Engine (1, 2, 3)

As part of this change, the behavior of “empty IN” expressions, where the list parameter is empty, is now standardized on use of the IN operator against a so-called “empty set”. As there is no standard SQL syntax for empty sets, a SELECT that returns no rows is used, tailored in specific ways for each backend so that the database treats it as an empty set; this feature was first introduced in version 1.3 and is described at Expanding IN feature now supports empty lists. The create_engine.empty_in_strategy parameter, introduced in version 1.2 as a means for migrating for how this case was treated for the previous IN system, is now deprecated and this flag no longer has an effect; as described in The IN / NOT IN operator’s empty collection behavior is now configurable; default expression simplified, this flag allowed a dialect to switch between the original system of comparing a column against itself, which turned out to be a huge performance issue, and a newer system of comparing “1 != 1” in order to produce a “false” expression. The 1.3 introduced behavior which now takes place in all cases is more correct than both approaches as the IN operator is still used, and does not have the performance issue of the original system.

In addition, the “expanding” parameter system has been generalized so that it also services other dialect-specific use cases where a parameter cannot be accommodated by the DBAPI or backing database; see New “post compile” bound parameters used for LIMIT/OFFSET in Oracle, SQL Server for details.

#4645

New Features - ORM

Raiseload for Columns

The “raiseload” feature, which raises InvalidRequestError when an unloaded attribute is accessed, is now available for column-oriented attributes using the orm.defer.raiseload parameter of defer(). This works in the same manner as that of the raiseload() option used by relationship loading:

book = session.query(Book).options(defer(Book.summary, raiseload=True)).first()

# would raise an exception
book.summary

To configure column-level raiseload on a mapping, the deferred.raiseload parameter of deferred() may be used. The undefer() option may then be used at query time to eagerly load the attribute:

class Book(Base):
    __tablename__ = 'book'

    book_id = Column(Integer, primary_key=True)
    title = Column(String(200), nullable=False)
    summary = deferred(Column(String(2000)), raiseload=True)
    excerpt = deferred(Column(Text), raiseload=True)

book_w_excerpt = session.query(Book).options(undefer(Book.excerpt)).first()

It was originally considered that the existing raiseload() option that works for relationship() attributes be expanded to also support column-oriented attributes. However, this would break the “wildcard” behavior of raiseload(), which is documented as allowing one to prevent all relationships from loading:

session.query(Order).options(
    joinedload(Order.items), raiseload('*'))

Above, if we had expanded raiseload() to accommodate for columns as well, the wildcard would also prevent columns from loading and thus be a backwards incompatible change; additionally, it’s not clear if raiseload() covered both column expressions and relationships, how one would achieve the effect above of only blocking relationship loads, without new API being added. So to keep things simple, the option for columns remains on defer():

raiseload() - query option to raise for relationship loads

orm.defer.raiseload - query option to raise for column expression loads

As part of this change, the behavior of “deferred” in conjunction with attribute expiration has changed. Previously, when an object would be marked as expired, and then unexpired via the access of one of the expired attributes, attributes which were mapped as “deferred” at the mapper level would also load. This has been changed such that an attribute that is deferred in the mapping will never “unexpire”, it only loads when accessed as part of the deferral loader.

An attribute that is not mapped as “deferred”, however was deferred at query time via the defer() option, will be reset when the object or attribute is expired; that is, the deferred option is removed. This is the same behavior as was present previously.

#4826

Behavioral Changes - ORM

Session does not immediately create a new SessionTransaction object

The Session object’s default behavior of autocommit=False historically has meant that there is always a SessionTransaction object in play, associated with the Session via the Session.transaction attribute. When the given SessionTransaction was complete, due to a commit, rollback, or close, it was immediately replaced with a new one. The SessionTransaction by itself does not imply the usage of any connection-oriented resources, so this long-standing behavior has a particular elegance to it in that the state of Session.transaction is always predictable as non-None.

However, as part of the initiative in #5056 to greatly reduce reference cycles, this assumption means that calling upon Session.close() results in a Session object that still has reference cycles and is more expensive to clean up, not to mention that there is a small overhead in constructing the SessionTransaction object, which meant that there would be unnecessary overhead created for a Session that for example invoked Session.commit() and then Session.close().

As such, it was decided that Session.close() should leave the internal state of self.transaction, now referred to internally as self._transaction, as None, and that a new SessionTransaction should only be created when needed. For consistency and code coverage, this behavior was also expanded to include all the points at which “autobegin” is expected, not just when Session.close() were called.

In particular, this causes a behavioral change for applications which subscribe to the SessionEvents.after_transaction_create() event hook; previously, this event would be emitted when the Session were first constructed, as well as for most actions that closed the previous transaction and would emit SessionEvents.after_transaction_end(). The new behavior is that SessionEvents.after_transaction_create() is emitted on demand, when the Session has not yet created a new SessionTransaction object and mapped objects are associated with the Session through methods like Session.add() and Session.delete(), when the Session.transaction attribute is called upon, when the Session.flush() method has tasks to complete, etc.

Besides the change in when the SessionEvents.after_transaction_create() event is emitted, the change should have no other user-visible impact on the Session object’s behavior; the Session will continue to have the behavior that it remains usable for new operations after Session.close() is called, and the sequencing of how the Session interacts with the Engine and the database itself should also remain unaffected, since these operations were already operating in an on-demand fashion.

#5074

Eager loaders emit during unexpire operations

A long sought behavior was that when an expired object is accessed, configured eager loaders will run in order to eagerly load relationships on the expired object when the object is refreshed or otherwise unexpired. This behavior has now been added, so that joinedloaders will add inline JOINs as usual, and selectin/subquery loaders will run an “immediateload” operation for a given relationship, when an expired object is unexpired or an object is refreshed:

>>> a1 = session.query(A).options(joinedload(A.bs)).first()
>>> a1.data = 'new data'
>>> session.commit()

Above, the A object was loaded with a joinedload() option associated with it in order to eagerly load the bs collection. After the session.commit(), the state of the object is expired. Upon accessing the .data column attribute, the object is refreshed and this will now include the joinedload operation as well:

>>> a1.data
SELECT a.id AS a_id, a.data AS a_data, b_1.id AS b_1_id, b_1.a_id AS b_1_a_id
FROM a LEFT OUTER JOIN b AS b_1 ON a.id = b_1.a_id
WHERE a.id = ?

The behavior applies both to loader strategies applied to the relationship() directly, as well as with options used with Query.options(), provided that the object was originally loaded by that query.

For the “secondary” eager loaders “selectinload” and “subqueryload”, the SQL strategy for these loaders is not necessary in order to eagerly load attributes on a single object; so they will instead invoke the “immediateload” strategy in a refresh scenario, which resembles the query emitted by “lazyload”, emitted as an additional query:

>>> a1 = session.query(A).options(selectinload(A.bs)).first()
>>> a1.data = 'new data'
>>> session.commit()
>>> a1.data
SELECT a.id AS a_id, a.data AS a_data
FROM a
WHERE a.id = ?
(1,)
SELECT b.id AS b_id, b.a_id AS b_a_id
FROM b
WHERE ? = b.a_id
(1,)

Note that a loader option does not apply to an object that was introduced into the Session in a different way. That is, if the a1 object were just persisted in this Session, or was loaded with a different query before the eager option had been applied, then the object doesn’t have an eager load option associated with it. This is not a new concept, however users who are looking for the eagerload on refresh behavior may find this to be more noticeable.

#1763

Accessing an uninitialized collection attribute on a transient object no longer mutates __dict__

It has always been SQLAlchemy’s behavior that accessing mapped attributes on a newly created object returns an implicitly generated value, rather than raising AttributeError, such as None for scalar attributes or [] for a list-holding relationship:

>>> u1 = User()
>>> u1.name
None
>>> u1.addresses
[]

The rationale for the above behavior was originally to make ORM objects easier to work with. Since an ORM object represents an empty row when first created without any state, it is intuitive that its un-accessed attributes would resolve to None (or SQL NULL) for scalars and to empty collections for relationships. In particular, it makes possible an extremely common pattern of being able to mutate the new collection without manually creating and assigning an empty collection first:

>>> u1 = User()
>>> u1.addresses.append(Address())  # no need to assign u1.addresses = []

Up until version 1.0 of SQLAlchemy, the behavior of this initialization system for both scalar attributes as well as collections would be that the None or empty collection would be populated into the object’s state, e.g. __dict__. This meant that the following two operations were equivalent:

>>> u1 = User()
>>> u1.name = None  # explicit assignment

>>> u2 = User()
>>> u2.name  # implicit assignment just by accessing it
None

Where above, both u1 and u2 would have the value None populated in the value of the name attribute. Since this is a SQL NULL, the ORM would skip including these values within an INSERT so that SQL-level defaults take place, if any, else the value defaults to NULL on the database side.

In version 1.0 as part of Changes to attribute events and other operations regarding attributes that have no pre-existing value, this behavior was refined so that the None value was no longer populated into __dict__, only returned. Besides removing the mutating side effect of a getter operation, this change also made it possible to set columns that did have server defaults to the value NULL by actually assigning None, which was now distinguished from just reading it.

The change however did not accommodate for collections, where returning an empty collection that is not assigned meant that this mutable collection would be different each time and also would not be able to correctly accommodate for mutating operations (e.g. append, add, etc.) called upon it. While the behavior continued to generally not get in anyone’s way, an edge case was eventually identified in #4519 where this empty collection could be harmful, which is when the object is merged into a session:

>>> u1 = User(id=1)  # create an empty User to merge with id=1 in the database
>>> merged1 = session.merge(u1)  # value of merged1.addresses is unchanged from that of the DB

>>> u2 = User(id=2) # create an empty User to merge with id=2 in the database
>>> u2.addresses
[]
>>> merged2 = session.merge(u2)  # value of merged2.addresses has been emptied in the DB

Above, the .addresses collection on merged1 will contain all the Address() objects that were already in the database. merged2 will not; because it has an empty list implicitly assigned, the .addresses collection will be erased. This is an example of where this mutating side effect can actually mutate the database itself.

While it was considered that perhaps the attribute system should begin using strict “plain Python” behavior, raising AttributeError in all cases for non-existent attributes on non-persistent objects and requiring that all collections be explicitly assigned, such a change would likely be too extreme for the vast number of applications that have relied upon this behavior for many years, leading to a complex rollout / backwards compatibility problem as well as the likelihood that workarounds to restore the old behavior would become prevalent, thus rendering the whole change ineffective in any case.

The change then is to keep the default producing behavior, but to finally make the non-mutating behavior of scalars a reality for collections as well, via the addition of additional mechanics in the collection system. When accessing the empty attribute, the new collection is created and associated with the state, however is not added to __dict__ until it is actually mutated:

>>> u1 = User()
>>> l1 = u1.addresses  # new list is created, associated with the state
>>> assert u1.addresses is l1  # you get the same list each time you access it
>>> assert "addresses" not in u1.__dict__  # but it won't go into __dict__ until it's mutated
>>> from sqlalchemy import inspect
>>> inspect(u1).attrs.addresses.history
History(added=None, unchanged=None, deleted=None)

When the list is changed, then it becomes part of the tracked changes to be persisted to the database:

>>> l1.append(Address())
>>> assert "addresses" in u1.__dict__
>>> inspect(u1).attrs.addresses.history
History(added=[<__main__.Address object at 0x7f49b725eda0>], unchanged=[], deleted=[])

This change is expected to have nearly no impact on existing applications in any way, except that it has been observed that some applications may be relying upon the implicit assignment of this collection, such as to assert that the object contains certain values based on its __dict__:

>>> u1 = User()
>>> u1.addresses
[]
# this will now fail, would pass before
>>> assert {k: v for k, v in u1.__dict__.items() if not k.startswith("_")} == {"addresses": []}

or to ensure that the collection won’t require a lazy load to proceed, the (admittedly awkward) code below will now also fail:

>>> u1 = User()
>>> u1.addresses
[]
>>> s.add(u1)
>>> s.flush()
>>> s.close()
>>> u1.addresses  # <-- will fail, .addresses is not loaded and object is detached

Applications that rely upon the implicit mutating behavior of collections will need to be changed so that they assign the desired collection explicitly:

>>> u1.addresses = []

#4519

The “New instance conflicts with existing identity” error is now a warning

SQLAlchemy has always had logic to detect when an object in the Session to be inserted has the same primary key as an object that is already present:

class Product(Base):
    __tablename__ = 'product'

    id = Column(Integer, primary_key=True)

session = Session(engine)

# add Product with primary key 1
session.add(Product(id=1))
session.flush()

# add another Product with same primary key
session.add(Product(id=1))
s.commit()  # <-- will raise FlushError

The change is that the FlushError is altered to be only a warning:

sqlalchemy/orm/persistence.py:408: SAWarning: New instance <Product at 0x7f1ff65e0ba8> with identity key (<class '__main__.Product'>, (1,), None) conflicts with persistent instance <Product at 0x7f1ff60a4550>

Subsequent to that, the condition will attempt to insert the row into the database which will emit IntegrityError, which is the same error that would be raised if the primary key identity was not already present in the Session:

sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: product.id

The rationale is to allow code that is using IntegrityError to catch duplicates to function regardless of the existing state of the Session, as is often done using savepoints:

# add another Product with same primary key
try:
    with session.begin_nested():
        session.add(Product(id=1))
except exc.IntegrityError:
    print("row already exists")

The above logic was not fully feasible earlier, as in the case that the Product object with the existing identity were already in the Session, the code would also have to catch FlushError, which additionally is not filtered for the specific condition of integrity issues. With the change, the above block behaves consistently with the exception of the warning also being emitted.

Since the logic in question deals with the primary key, all databases emit an integrity error in the case of primary key conflicts on INSERT. The case where an error would not be raised, that would have earlier, is the extremely unusual scenario of a mapping that defines a primary key on the mapped selectable that is more restrictive than what is actually configured in the database schema, such as when mapping to joins of tables or when defining additional columns as part of a composite primary key that is not actually constrained in the database schema. However, these situations also work more consistently in that the INSERT would theoretically proceed whether or not the existing identity were still in the database. The warning can also be configured to raise an exception using the Python warnings filter.

#4662

New Features - Core

Built-in FROM linting will warn for any potential cartesian products in a SELECT statement

As the Core expression language as well as the ORM are built on an “implicit FROMs” model where a particular FROM clause is automatically added if any part of the query refers to it, a common issue is the case where a SELECT statement, either a top level statement or an embedded subquery, contains FROM elements that are not joined to the rest of the FROM elements in the query, causing what’s referred to as a “cartesian product” in the result set, i.e. every possible combination of rows from each FROM element not otherwise joined. In relational databases, this is nearly always an undesirable outcome as it produces an enormous result set full of duplicated, uncorrelated data.

SQLAlchemy, for all of its great features, is particularly prone to this sort of issue happening as a SELECT statement will have elements added to its FROM clause automatically from any table seen in the other clauses. A typical scenario looks like the following, where two tables are JOINed together, however an additional entry in the WHERE clause that perhaps inadvertently does not line up with these two tables will create an additional FROM entry:

address_alias = aliased(Address)

q = session.query(User).\
    join(address_alias, User.addresses).\
    filter(Address.email_address == 'foo')

The above query selects from a JOIN of User and address_alias, the latter of which is an alias of the Address entity. However, the Address entity is used within the WHERE clause directly, so the above would result in the SQL:

SELECT
    users.id AS users_id, users.name AS users_name,
    users.fullname AS users_fullname,
    users.nickname AS users_nickname
FROM addresses, users JOIN addresses AS addresses_1 ON users.id = addresses_1.user_id
WHERE addresses.email_address = :email_address_1

In the above SQL, we can see what SQLAlchemy developers term “the dreaded comma”, as we see “FROM addresses, users JOIN addresses” in the FROM clause which is the classic sign of a cartesian product; where a query is making use of JOIN in order to join FROM clauses together, however because one of them is not joined, it uses a comma. The above query will return a full set of rows that join the “user” and “addresses” table together on the “id / user_id” column, and will then apply all those rows into a cartesian product against every row in the “addresses” table directly. That is, if there are ten user rows and 100 rows in addresses, the above query will return its expected result rows, likely to be 100 as all address rows would be selected, multiplied by 100 again, so that the total result size would be 10000 rows.

The “table1, table2 JOIN table3” pattern is one that also occurs quite frequently within the SQLAlchemy ORM due to either subtle mis-application of ORM features particularly those related to joined eager loading or joined table inheritance, as well as a result of SQLAlchemy ORM bugs within those same systems. Similar issues apply to SELECT statements that use “implicit joins”, where the JOIN keyword is not used and instead each FROM element is linked with another one via the WHERE clause.

For some years there has been a recipe on the Wiki that applies a graph algorithm to a select() construct at query execution time and inspects the structure of the query for these un-linked FROM clauses, parsing through the WHERE clause and all JOIN clauses to determine how FROM elements are linked together and ensuring that all the FROM elements are connected in a single graph. This recipe has now been adapted to be part of the SQLCompiler itself where it now optionally emits a warning for a statement if this condition is detected. The warning is enabled using the create_engine.enable_from_linting flag and is enabled by default. The computational overhead of the linter is very low, and additionally it only occurs during statement compilation which means for a cached SQL statement it only occurs once.

Using this feature, our ORM query above will emit a warning:

>>> q.all()
SAWarning: SELECT statement has a cartesian product between FROM
element(s) "addresses_1", "users" and FROM element "addresses".
Apply join condition(s) between each element to resolve.

The linter feature accommodates not just for tables linked together through the JOIN clauses but also through the WHERE clause Above, we can add a WHERE clause to link the new Address entity with the previous address_alias entity and that will remove the warning:

q = session.query(User).\
    join(address_alias, User.addresses).\
    filter(Address.email_address == 'foo').\
    filter(Address.id == address_alias.id)  # resolve cartesian products,
                                            # will no longer warn

The cartesian product warning considers any kind of link between two FROM clauses to be a resolution, even if the end result set is still wasteful, as the linter is intended only to detect the common case of a FROM clause that is completely unexpected. If the FROM clause is referred to explicitly elsewhere and linked to the other FROMs, no warning is emitted:

q = session.query(User).\
    join(address_alias, User.addresses).\
    filter(Address.email_address == 'foo').\
    filter(Address.id > address_alias.id)  # will generate a lot of rows,
                                           # but no warning

Full cartesian products are also allowed if they are explicitly stated; if we wanted for example the cartesian product of User and Address, we can JOIN on true() so that every row will match with every other; the following query will return all rows and produce no warnings:

from sqlalchemy import true

# intentional cartesian product
q = session.query(User).join(Address, true())  # intentional cartesian product

The warning is only generated by default when the statement is compiled by the Connection for execution; calling the ClauseElement.compile() method will not emit a warning unless the linting flag is supplied:

>>> from sqlalchemy.sql import FROM_LINTING
>>> print(q.statement.compile(linting=FROM_LINTING))
SAWarning: SELECT statement has a cartesian product between FROM element(s) "addresses" and FROM element "users".  Apply join condition(s) between each element to resolve.
SELECT users.id, users.name, users.fullname, users.nickname
FROM addresses, users JOIN addresses AS addresses_1 ON users.id = addresses_1.user_id
WHERE addresses.email_address = :email_address_1

#4737

Behavior Changes - Core

SELECT objects and derived FROM clauses allow for duplicate columns and column labels

This change allows that the select() construct now allows for duplicate column labels as well as duplicate column objects themselves, so that result tuples are organized and ordered in the identical way in that the columns were selected. The ORM Query already works this way, so this change allows for greater cross-compatibility between the two, which is a key goal of the 2.0 transition:

>>> from sqlalchemy import column, select
>>> c1, c2, c3, c4 = column('c1'), column('c2'), column('c3'), column('c4')
>>> stmt = select([c1, c2, c3.label('c2'), c2, c4])
>>> print(stmt)
SELECT c1, c2, c3 AS c2, c2, c4

To support this change, the ColumnCollection used by SelectBase as well as for derived FROM clauses such as subqueries also support duplicate columns; this includes the new SelectBase.selected_columns attribute, the deprecated SelectBase.c attribute, as well as the FromClause.c attribute seen on constructs such as Subquery and Alias:

>>> list(stmt.selected_columns)
[
    <sqlalchemy.sql.elements.ColumnClause at 0x7fa540bcca20; c1>,
    <sqlalchemy.sql.elements.ColumnClause at 0x7fa540bcc9e8; c2>,
    <sqlalchemy.sql.elements.Label object at 0x7fa540b3e2e8>,
    <sqlalchemy.sql.elements.ColumnClause at 0x7fa540bcc9e8; c2>,
    <sqlalchemy.sql.elements.ColumnClause at 0x7fa540897048; c4>
]

>>> print(stmt.subquery().select())
SELECT anon_1.c1, anon_1.c2, anon_1.c2, anon_1.c2, anon_1.c4
FROM (SELECT c1, c2, c3 AS c2, c2, c4) AS anon_1

ColumnCollection also allows access by integer index to support when the string “key” is ambiguous:

>>> stmt.selected_columns[2]
<sqlalchemy.sql.elements.Label object at 0x7fa540b3e2e8>

To suit the use of ColumnCollection in objects such as Table and PrimaryKeyConstraint, the old “deduplicating” behavior which is more critical for these objects is preserved in a new class DedupeColumnCollection.

The change includes that the familiar warning "Column %r on table %r being replaced by %r, which has the same key.  Consider use_labels for select() statements." is removed; the Select.apply_labels() is still available and is still used by the ORM for all SELECT operations, however it does not imply deduplication of column objects, although it does imply deduplication of implicitly generated labels:

>>> from sqlalchemy import table
>>> user = table('user', column('id'), column('name'))
>>> stmt = select([user.c.id, user.c.name, user.c.id]).apply_labels()
>>> print(stmt)
SELECT "user".id AS user_id, "user".name AS user_name, "user".id AS id_1
FROM "user"

Finally, the change makes it easier to create UNION and other CompoundSelect objects, by ensuring that the number and position of columns in a SELECT statement mirrors what was given, in a use case such as:

>>> s1 = select([user, user.c.id])
>>> s2 = select([c1, c2, c3])
>>> from sqlalchemy import union
>>> u = union(s1, s2)
>>> print(u)
SELECT "user".id, "user".name, "user".id
FROM "user" UNION SELECT c1, c2, c3

#4753

The “RowProxy” is no longer a “proxy”, now called Row

Since the beginning of SQLAlchemy, the Core result objects exposed to the user are the ResultProxy and RowProxy objects. The name “proxy” refers to the GOF Proxy Pattern, emphasizing that these objects are presenting a facade around the DBAPI cursor object and the tuple-like objects returned by methods such as cursor.fetchone(); as methods on the result and row proxy objects are invoked, the underlying methods or data members of the cursor and the tuple-like objects returned are invoked.

In particular, SQLAlchemy’s row-processing functions would be invoked as a particular column in a row is accessed. By row-processing functions, we refer to functions such as that of the Unicode datatype, which under Python 2 would often convert Python string objects to Python unicode objects, as well as numeric functions that produce Decimal objects, SQLite datetime functions that produce datetime objects from string representations, as well as any-number of user-defined functions which can be created using TypeDecorator.

The rationale for this pattern was performance, where the anticipated use case of fetching a row from a legacy database that contained dozens of columns would not need to run, for example, a unicode converter on every element of each row, if only a few columns in the row were being fetched. SQLAlchemy eventually gained C extensions which allowed for additional performance gains within this process.

As part of SQLAlchemy 1.4’s goal of migrating towards SQLAlchemy 2.0’s updated usage patterns, row objects will be made to behave more like tuples. To suit this, the “proxy” behavior of Row has been removed and instead the row is populated with its final data values upon construction. This in particular allows an operation such as obj in row to work as that of a tuple where it tests for containment of obj in the row itself, rather than considering it to be a key in a mapping as is the case now. For the moment, obj in row still does a key lookup, that is, detects if the row has a particular column name as obj, however this behavior is deprecated and in 2.0 the Row will behave fully as a tuple-like object; lookup of keys will be via the ._mapping attribute.

The result of removing the proxy behavior from rows is that the C code has been simplified and the performance of many operations is improved both with and without the C extensions in use. Modern Python DBAPIs handle unicode conversion natively in most cases, and SQLAlchemy’s unicode handlers are very fast in any case, so the expense of unicode conversion is a non-issue.

This change by itself has no behavioral impact on the row, but is part of a larger series of changes in #4710 which unifies the Core row/result facade with that of the ORM.

#4710

Improved column labeling for simple column expressions using CAST or similar

A user pointed out that the PostgreSQL database has a convenient behavior when using functions like CAST against a named column, in that the result column name is named the same as the inner expression:

test=> SELECT CAST(data AS VARCHAR) FROM foo;

data
------
 5
(1 row)

This allows one to apply CAST to table columns while not losing the column name (above using the name "data") in the result row. Compare to databases such as MySQL/MariaDB, as well as most others, where the column name is taken from the full SQL expression and is not very portable:

MariaDB [test]> SELECT CAST(data AS CHAR) FROM foo;
+--------------------+
| CAST(data AS CHAR) |
+--------------------+
| 5                  |
+--------------------+
1 row in set (0.003 sec)

In SQLAlchemy Core expressions, we never deal with a raw generated name like the above, as SQLAlchemy applies auto-labeling to expressions like these, which are up until now always a so-called “anonymous” expression:

>>> print(select([cast(foo.c.data, String)]))
SELECT CAST(foo.data AS VARCHAR) AS anon_1     # old behavior
FROM foo

These anonymous expressions were necessary as SQLAlchemy’s ResultProxy made heavy use of result column names in order to match up datatypes, such as the String datatype which used to have result-row-processing behavior, to the correct column, so most importantly the names had to be both easy to determine in a database-agnostic manner as well as unique in all cases. In SQLAlchemy 1.0 as part of #918, this reliance on named columns in result rows (specifically the cursor.description element of the PEP-249 cursor) was scaled back to not be necessary for most Core SELECT constructs; in release 1.4, the system overall is becoming more comfortable with SELECT statements that have duplicate column or label names such as in SELECT objects and derived FROM clauses allow for duplicate columns and column labels. So we now emulate PostgreSQL’s reasonable behavior for simple modifications to a single column, most prominently with CAST:

>>> print(select([cast(foo.c.data, String)]))
SELECT CAST(foo.data AS VARCHAR) AS data
FROM foo

For CAST against expressions that don’t have a name, the previous logic is used to generate the usual “anonymous” labels:

>>> print(select([cast('hi there,' + foo.c.data, String)]))
SELECT CAST(:data_1 + foo.data AS VARCHAR) AS anon_1
FROM foo

A cast() against a Label, despite having to omit the label expression as these don’t render inside of a CAST, will nonetheless make use of the given name:

>>> print(select([cast(('hi there,' + foo.c.data).label('hello_data'), String)]))
SELECT CAST(:data_1 + foo.data AS VARCHAR) AS hello_data
FROM foo

And of course as was always the case, Label can be applied to the expression on the outside to apply an “AS <name>” label directly:

>>> print(select([cast(('hi there,' + foo.c.data), String).label('hello_data')]))
SELECT CAST(:data_1 + foo.data AS VARCHAR) AS hello_data
FROM foo

#4449

New “post compile” bound parameters used for LIMIT/OFFSET in Oracle, SQL Server

A major goal of the 1.4 series is to establish that all Core SQL constructs are completely cacheable, meaning that a particular Compiled structure will produce an identical SQL string regardless of any SQL parameters used with it, which notably includes those used to specify the LIMIT and OFFSET values, typically used for pagination and “top N” style results.

While SQLAlchemy has used bound parameters for LIMIT/OFFSET schemes for many years, a few outliers remained where such parameters were not allowed, including a SQL Server “TOP N” statement, such as:

SELECT TOP 5 mytable.id, mytable.data FROM mytable

as well as with Oracle, where the FIRST_ROWS() hint (which SQLAlchemy will use if the optimize_limits=True parameter is passed to create_engine() with an Oracle URL) does not allow them, but also that using bound parameters with ROWNUM comparisons has been reported as producing slower query plans:

SELECT anon_1.id, anon_1.data FROM (
    SELECT /*+ FIRST_ROWS(5) */
    anon_2.id AS id,
    anon_2.data AS data,
    ROWNUM AS ora_rn FROM (
        SELECT mytable.id, mytable.data FROM mytable
    ) anon_2
    WHERE ROWNUM <= :param_1
) anon_1 WHERE ora_rn > :param_2

In order to allow for all statements to be unconditionally cacheable at the compilation level, a new form of bound parameter called a “post compile” parameter has been added, which makes use of the same mechanism as that of “expanding IN parameters”. This is a bindparam() that behaves identically to any other bound parameter except that parameter value will be rendered literally into the SQL string before sending it to the DBAPI cursor.execute() method. The new parameter is used internally by the SQL Server and Oracle dialects, so that the drivers receive the literal rendered value but the rest of SQLAlchemy can still consider this as a bound parameter. The above two statements when stringified using str(statement.compile(dialect=<dialect>)) now look like:

SELECT TOP [POSTCOMPILE_param_1] mytable.id, mytable.data FROM mytable

and:

SELECT anon_1.id, anon_1.data FROM (
    SELECT /*+ FIRST_ROWS([POSTCOMPILE__ora_frow_1]) */
    anon_2.id AS id,
    anon_2.data AS data,
    ROWNUM AS ora_rn FROM (
        SELECT mytable.id, mytable.data FROM mytable
    ) anon_2
    WHERE ROWNUM <= [POSTCOMPILE_param_1]
) anon_1 WHERE ora_rn > [POSTCOMPILE_param_2]

The [POSTCOMPILE_<param>] format is also what is seen when an “expanding IN” is used.

When viewing the SQL logging output, the final form of the statement will be seen:

SELECT anon_1.id, anon_1.data FROM (
    SELECT /*+ FIRST_ROWS(5) */
    anon_2.id AS id,
    anon_2.data AS data,
    ROWNUM AS ora_rn FROM (
        SELECT mytable.id AS id, mytable.data AS data FROM mytable
    ) anon_2
    WHERE ROWNUM <= 8
) anon_1 WHERE ora_rn > 3

The “post compile parameter” feature is exposed as public API through the bindparam.literal_execute parameter, however is currently not intended for general use. The literal values are rendered using the TypeEngine.literal_processor() of the underlying datatype, which in SQLAlchemy has extremely limited scope, supporting only integers and simple string values.

#4808

Connection-level transactions can now be inactive based on subtransaction

A Connection now includes the behavior where a Transaction can be made inactive due to a rollback on an inner transaction, however the Transaction will not clear until it is itself rolled back.

This is essentially a new error condition which will disallow statement executions to proceed on a Connection if an inner “sub” transaction has been rolled back. The behavior works very similarly to that of the ORM Session, where if an outer transaction has been begun, it needs to be rolled back to clear the invalid transaction; this behavior is described in “This Session’s transaction has been rolled back due to a previous exception during flush.” (or similar)

While the Connection has had a less strict behavioral pattern than the Session, this change was made as it helps to identify when a subtransaction has rolled back the DBAPI transaction, however the external code isn’t aware of this and attempts to continue proceeding, which in fact runs operations on a new transaction. The “test harness” pattern described at Joining a Session into an External Transaction (such as for test suites) is the common place for this to occur.

The new behavior is described in the errors page at This connection is on an inactive transaction. Please rollback() fully before proceeding.

Dialect Changes

Removed “join rewriting” logic from SQLite dialect; updated imports

Dropped support for right-nested join rewriting to support old SQLite versions prior to 3.7.16, released in 2013. It is not expected that any modern Python versions rely upon this limitation.

The behavior was first introduced in 0.9 and was part of the larger change of allowing for right nested joins as described at Many JOIN and LEFT OUTER JOIN expressions will no longer be wrapped in (SELECT * FROM ..) AS ANON_1. However the SQLite workaround produced many regressions in the 2013-2014 period due to its complexity. In 2016, the dialect was modified so that the join rewriting logic would only occur for SQLite verisons prior to 3.7.16 after bisection was used to identify where SQLite fixed its support for this construct, and no further issues were reported against the behavior (even though some bugs were found internally). It is now anticipated that there are little to no Python builds for Python 2.7 or 3.4 and above (the supported Python versions) which would include a SQLite version prior to 3.7.17, and the behavior is only necessary only in more complex ORM joining scenarios. A warning is now emitted if the installed SQLite version is older than 3.7.16.

In related changes, the module imports for SQLite no longer attempt to import the “pysqlite2” driver on Python 3 as this driver does not exist on Python 3; a very old warning for old pysqlite2 versions is also dropped.

#4895

Added Sequence support for MariaDB 10.3

The MariaDB database as of 10.3 supports sequences. SQLAlchemy’s MySQL dialect now implements support for the Sequence object against this database, meaning “CREATE SEQUENCE” DDL will be emitted for a Sequence that is present in a Table or MetaData collection in the same way as it works for backends such as PostgreSQL, Oracle, when the dialect’s server version check has confirmed the database is MariaDB 10.3 or greater. Additionally, the Sequence will act as a column default and primary key generation object when used in these ways.

Since this change will impact the assumptions both for DDL as well as the behavior of INSERT statements for an application that is currently deployed against MariaDB 10.3 which also happens to make explicit use the Sequence construct within its table definitions, it is important to note that Sequence supports a flag Sequence.optional which is used to limit the scenarios in which the Sequence to take effect. When “optional” is used on a Sequence that is present in the integer primary key column of a table:

Table(
    "some_table", metadata,
    Column("id", Integer, Sequence("some_seq", optional=True), primary_key=True)
)

The above Sequence is only used for DDL and INSERT statements if the target database does not support any other means of generating integer primary key values for the column. That is, the Oracle database above would use the sequence, however the PostgreSQL and MariaDB 10.3 databases would not. This may be important for an existing application that is upgrading to SQLAlchemy 1.4 which may not have emitted DDL for this Sequence against its backing database, as an INSERT statement will fail if it seeks to use a sequence that was not created.

#4976

Previous: Changes and Migration Next: 1.4 Changelog