Gadfly: SQL Relational Database in Python, 1.0
Gadfly is a simple relational database system implemented in Python based on the SQL Structured Query Language.
One of the most compelling aspects of Gadfly is that it runs whereever Python runs and supports client/server on any platform that supports the standard Python socket interface. Even the file formats used by Gadfly for storage are cross-platform -- a gadfly database directory can be moved from Win95 to Linux using a binary copying mechanism and gadfly will read and run the database.
It supports persistent databases consisting of a collection of structured tables with indices, and a large subset of SQL for accessing and modifying those tables. It supports a log based recovery protocol which allows committed operations of a database to be recovered even if the database was not shut down in a proper manner (ie, in the event of a CPU or software crash, [but not in the event of a disk crash]). It also supports a TCP/IP Client/Server mode where remote clients can access a Gadfly database over a TCP/IP network (such as the Internet) subject to configurable security mechanisms.
Because it lacks (at this time) true concurrency control, and file-system based indexing it is not appropriate for very large multiprocess transaction based systems.
Since Gadfly depends intimately on the kwParsing package it is distributed as part of the kwParsing package, under the same generous copyright.
Gadfly allows Python programs a convenient way to store, retrieve and query tabular data without having to rely on any external database engine or package. That is, Gadfly provides a simple, easy, and relatively efficient in-memory relational database style engine for Python programs, complete with a notion of a "committed, recoverable transaction" and "aborts".
Looking at the /etc directory in unix or at the Registry under win32 or at the buzzillions of configuration files one find sitting around file systems it becomes apparent that modern programs depend heavily on tabular data. Furthermore as memory prices continue to drop and inexpensive machines keep growing larger and larger memory capacity it is clear that more and more database-style work can be done on largish data sets in memory, and hence a simple in-memory SQL implementation like Gadfly may be useful for serious work.
Gadfly uses relational style representations and the SQL query language primarily because these are widely understood and familiar to many programmers. SQL can't do everything, but it is successful in part because it can do a lot of important things easily and well. (Python can do everything else...)
The main "gadfly.py" module attempts to faithfully adhere to Greg Stein's Python Database API, as discussed and certified by the Python DB-SIG. That said, some of the API that I didn't really understand is not implemented yet. Please look to gadfly.py to determine exactly what parts of the interface are implemented or stubbed.
Concurrent database updates are not supported. The "databases" are currently designed to be written/modified by one process in isolation. Multiple processes can access a Gadfly database when accesses are arbitrated by a tcp/ip Gadfly server process.
Unlike most Python/database-engine interfaces you must create a Gadfly database using Python (whereas with Oracle you'd use other tools, for example). To accomplish this use
import gadfly connection = gadfly.gadfly()with no arguments and then startup a database using the startup method.
connection.startup("mydatabase", "mydirectory")Here "mydirectory" must be a directory which exists and which can be written to in order to store the database files. The startup will create some files in "mydirectory". This will have the effect of clobbering any existing Gadfly database called "mydatabase" in the directory "mydirectory". Gadfly will prevent you from starting up the same connection twice, however.
Note that the first "import gadfly" reads in and initializes some rather large data structures used for parsing SQL, and thus may take longer than other module imports.
Now with your new database you can create tables, populate them, and commit the result when you are happy.
cursor = connection.cursor() cursor.execute("create table ph (nm varchar, ph varchar)") cursor.execute("insert into ph(nm, ph) values ('arw', '3367')") cursor.execute("select * from ph") for x in cursor.fetchall(): print x # prints ('arw', '3367') connection.commit()
import gadfly connection = gadfly.gadfly("mydatabase", "mydirectory")This will read in the database tables with the most recently committed values. The initialized database may now be queried and updated.
cursor = connection.cursor() cursor.execute("update ph set nm='aaron' where nm='arw'") cursor.execute("select * from ph") for x in cursor.fetchall(): print x # prints ('aaron', '3367')If you do not wish to commit updates you may simply not execute a commit on the connection object (which writes out the tables). If you wish to restore the old values from the existing database use
connection.abort()Updates are only stored upon a connection.commit(). [Actually, if autocheckpoint is disabled, updates are only stored to table files on checkpoint -- see the documentation on the recovery mechanism.]
print cursor.pp()to "pretty print" the result of any evaluation (which might be None for a non-select). the SQL constructs page for a more detailed presentation. SQL statements supported include the SELECT:
SELECT [DISTINCT|ALL] expressions or * FROM tables [WHERE condition] [GROUP BY group-expressions] [HAVING aggregate-condition] [union-clause] [ORDER BY columns]This statement is quite powerful. It reads intuitively as follows:
1) Make all combinations of rows from the tables (FROM line) 2) Eliminate those combinations not satisfying condition (WHERE line) 3) (if GROUP present) form aggregate groups that match on group-expressions 4) (if HAVING present) eliminate aggregate groups that don't satisfy the aggregate-condition. 5) compute the columns to keep (SELECT line). 6) (if union-clause present) combine (union, difference, intersect) the result with the result of another select statement. 7) if DISTINCT, throw out redundant entries. 8) (if ORDER present) order the result by the columns (ascending or descending as specified, with precedence as listed).The actual implementation in gadfly is much more optimal than the intuitive reading, particularly at steps 1 and 2 (which are combined via optimizing transformations and hash join algorithms).
Conditions may include equalities, and inequalities of expressions. Conditions may also be combined using AND, OR, NOT. Expressions include column names, constants, and standard arithmetic operations over them.
Embedded queries supported include subquery expressions, expr IN (subselect), quantified comparisons, and the EXISTS (subselect) predicate.
Aggregate tests and computations can only be applied after the GROUPing and before the columns are selected (steps 3,4,5). Aggregate operations include COUNT(*), COUNT(expression), AVG(expression), SUM(expression), MAX(expression), MIN(expression), and the non-standard MEDIAN(expression). These may be applied to DISTINCT values (throwing out redundancies, as in COUNT(DISTINCT drinker). if no GROUPing is present the aggregate computations apply to the entire result after step 2.
There is much more to know about the SELECT statement. The test suite (gftest.py) gives numerous examples of SELECT statements, including:
select * from frequents where drinker = 'norm' select drinker from likes union select drinker from frequents select drinker from likes except select drinker from frequents select * from frequents where drinker>'norm' or drinker<'b' select * from frequents as f, serves as s where f.bar = s.bar select * from frequents as f, serves as s where f.bar = s.bar and not exists( select l.drinker, l.beer from likes l where l.drinker=f.drinker and s.beer=l.beer) select sum(quantity), avg(quantity), count(*), sum(quantity)/count(quantity) from serves select bar, sum(quantity), avg(quantity), count(*), sum(quantity)/count(quantity) from serves where beer<>'bud' group by bar having sum(quantity)>500 or count(*)>3 order by 2 desc select l.drinker, l.beer, count(*), sum(l.perday*f.perweek) from likes l, frequents f where l.drinker=f.drinker group by l.drinker, l.beer order by 4 desc, l.drinker, l.beerPlease examine sqlgram.py for a precise definition of the supported syntax. Please find any of the 500 books on SQL for a description of the meaning of these constructs. Please inform me if any of them give the wrong result when executed in Gadfly!
CREATE TABLE name (colname datatype [, colname datatype...])Data types currently "supported" are integer, float, and varchar. They are ignored by the implementation, anything that is hashable and marshallable can currently go in any column (but that is likely to change). For example
create table frequents (drinker varchar, bar varchar, perweek integer)At present you can put tuples, complexes, or anything else into a column specified as "varchar". Don't count on that always being true, please.
DELETE FROM table WHERE condition UPDATE table SET col=expr [, col=expr...] WHERE condition INSERT INTO table [(column [, column...])] values (value [, value...]) INSERT INTO table [(column [, column...])] subselect CREATE [UNIQUE] INDEX name ON table (column [, column...]) DROP TABLE table DROP INDEX nameFor example
delete from templikes where be='rollingrock' update templikes set dr='norman' where dr='norm' insert into ph(nm,ph) values ('nan', '0356') insert into templikes(dr, be) select drinker, beer from likes create index sbb on serves (beer, bar) drop table templikes drop index tdindexMultiple statements may be executed in one cursor.execute(S) by separating the statements with semicolons in S, for example S might have the string value
drop index tdindex; drop table templikes(no final semicolon please!).
Please see gftest.py for examples of most of these. Remember that SQL is case insensitive (capitalization of keywords doesn't matter). Please see sqlgram.py for a precise definition of all supported constructs
insertstat = "insert into ph(nm,ph) values (?, ?)" cursor.execute(insertstat, ('nan', "0356")) cursor.execute(insertstat, ('bill', "2356")) cursor.execute(insertstat, ('tom', "4356"))Dynamic values allow the cursor to use the same parsed expression many times for a similar operation. Above the insertstat is parsed and bound to the database only once. Using dynamic attributes should speed up accesses. Thus the above should run much faster than the equivalent
cursor.execute("insert into ph(nm,ph) values ('nan', '0356')"); cursor.execute("insert into ph(nm,ph) values ('bill', '2356')"); cursor.execute("insert into ph(nm,ph) values ('tom', '4356')");Dynamic attributes can appear in other statements containing expressions (such as SELECTs, UPDATEs and DELETEs too).
For SELECT, UPDATE, and DELETE the dynamic expression substitutions must consist of a single tuple, as in
stat = "select * from ph where nm=?" cursor.execute(stat, ("nan",)) ... cursor.execute(stat, ("bob",)) ...Since the dynamic substitution eliminates the need for parsing and binding (expensive operations!) the above should run faster than the equivalent
cursor.execute("select * from ph where nm='nan'") ... cursor.execute("select * from ph where nm='bob'") ...If you repeat several similar queries multiple times, associate each query "template string" with a unique cursor object so that each template must be parsed and bound only once. Note that some relatively complex queries from the test suite run 2 to 3 times faster after they have been parsed and bound, even without the kjbuckets builtin. With kjbuckets the same ran 5 to 10 times faster.
data = [('nan', "0356")), ('bill', "2356"), ('tom', "4356")] stat = "insert into ph(nm,ph) values (?, ?)" cursor.execute(stat, data)...and it would be even faster if the cursor had previously executed the stat with different data (since then no parsing or binding would occur).
>>> g = gadfly() >>> g.startup("dbtest", "dbtest") >>> c = g.cursor() >>> c.execute("select * from __table_names__") >>> print c.pp() IS_VIEW | TABLE_NAME ========================= 1 | __TABLE_NAMES__ 1 | DUAL 1 | __DATADEFS__ 1 | __COLUMNS__ 1 | __INDICES__ 1 | __INDEXCOLS__Here DUAL is a standard one row/one column test table (from the Oracle tradition) and the other tables provide information about the database schema.
>>> c.execute("create table t1 (a varchar, b varchar)") >>> c.execute("create table t2 (b varchar, c varchar)") >>> c.execute("create unique index t1a on t1(a)") >>> c.execute("create index t1b on t1(b)") >>> c.execute("select * from __table_names__") >>> print c.pp() IS_VIEW | TABLE_NAME ========================= 0 | T1 1 | __DATADEFS__ 1 | __INDICES__ 0 | T2 1 | __TABLE_NAMES__ 1 | __COLUMNS__ 1 | DUAL 1 | __INDEXCOLS__ >>> c.execute("select * from __columns__") >>> print c.pp() COLUMN_NAME | TABLE_NAME ============================= A | T1 B | T1 NAME | __DATADEFS__ DEFN | __DATADEFS__ INDEX_NAME | __INDICES__ TABLE_NAME | __INDICES__ IS_UNIQUE | __INDICES__ TABLE_NAME | __TABLE_NAMES__ IS_VIEW | __TABLE_NAMES__ B | T2 C | T2 COLUMN1 | DUAL TABLE_NAME | __COLUMNS__ COLUMN_NAME | __COLUMNS__ INDEX_NAME | __INDEXCOLS__ COLUMN_NAME | __INDEXCOLS__ >>> c.execute("select * from __indices__") >>> print c.pp() IS_UNIQUE | TABLE_NAME | INDEX_NAME =================================== 0 | T1 | T1B 1 | T1 | T1A >>> c.execute("select * from __indexcols__") >>> print c.pp() COLUMN_NAME | INDEX_NAME ======================== B | T1B A | T1A >>> c.execute("select * from dual") >>> print c.pp() COLUMN1 ======= 0
InstallationTo guarantee correct installation, please follow the following procedure. Of course you must have Python in order to use this package!
Unpack the package into a directory on the Python search path.
Install: In the installation directory run the command
% python gfinstall.pyThis creates sqlwhere.py, which aides in the location of the parser data file required for parsing SQL. In some circumstances this may create the data file also. If it does the parser generation process may take several minutes, but you only need to run it during installation.
If you think there is a problem Use
% python gfinstall.py forceto force regeneration of the parser structures (for example if you modify the grammar or suspect the grammar data file sql.mar is corrupt).
Test/compile: Now (possibly in a different directory) run
% mkdir dbtest % python gftest.py dbtestThis will create a test database and exercise the system to make sure it works. It also should have the side effect of byte compiling all *.py files required by gadfly. Run gftest.py as a user with write permission to the installation directory in order to guarantee byte compilation. It should just work ;c).
Python 1.4 (Feb 4 1997) [MSC 32 bit (Intel)] Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> from gadfly import gadfly >>> connection = gadfly("test", "dbtest") >>> cursor = connection.cursor() >>> cursor.execute("select * from frequents") >>> cursor.description (('DRINKER', None, None, None, None, None, None), ('PERWEEK', None, None, None, None, None, None), ('BAR', None, None, None, None, None, None)) >>> print cursor.pp() DRINKER | PERWEEK | BAR ============================ adam | 1 | lolas woody | 5 | cheers sam | 5 | cheers norm | 3 | cheers wilt | 2 | joes norm | 1 | joes lola | 6 | lolas norm | 2 | lolas woody | 1 | lolas pierre | 0 | frankies >>>
A. Watters, "Interpreting a Reconstructed Relational Calculus", ACM SIGMOD Proceedings, 1993, Washington DC, pp. 367-376.The most basic data structures of the implementation are given in either kjbuckets0.py or the faster kjbucketsmodule.c, which implement the same data type signatures in Python and in a C extension to Python respectively.
The gadfly.py module is a simple wrapper that provides a standard DBAPI interface to the system. The installation script gfinstall.py attempts to install the system, creating the grammar file sql.mar if needed (or if "forced"). The test suite gftest.py (which requires a writeable directory argument) attempts to provide a regression test and a demonstration of the system. The SQL parser also requires the kwParsing parser generation package, which consists of a number of additional python modules.me.
The query engine should run faster if you have the builtin module kjbuckets installed. Otherwise it will use a "python imitation" kjbuckets0.py. In one test the test suite ran two times faster using kjbuckets. I suspect it will have a higher payoff for larger data sets.