firstworks Introduction to Database Access from Web-Based Applications

Introduction to Database Access from Web-Based Applications

It's rare these days that a substantial web-based application is written that doesn't need to access some sort of database. This discussion examines (in reasonably high level detail) the different tools available for developing web based applications and the database access options a developer has with each.

The definition of "Database"

First, the question "What do you mean when you say database?" needs to be answered. That may seem like a no-brainer, but I've argued long and hard with people about the definition of the word "database" in the past. For the purposes of this discussion, a database is an organized store of information that an application can access. That's a very general definition, and there is probably a more specific and perhaps more accurate definition somewhere on the net, but that's what "database" means in this discussion.

Different Types of Databases

There are quite a few different types of databases. Heirarchical and relational databases are probably the only ones relevant to a developer today.

Heirarchical databases are constructed like a tree (actually an acyclic (or infrequently a cyclic) graph) of nodes. Each node must be referenced by the path to that node from some other node. Individual nodes can contain a variety of information, but the tree structure is what defines the database type. Examples of heirarchical databases are LDAP databases and filesystems. Yes, a filesystem is a actually a database; a heirarchical one.

Relational databases are collections of tables. Each table consists of rows and columns. The intersection of a row and column is called a field. Columns have a data type such as character, numeric or boolean and usually a length. What makes this whole thing relational is that a field in one table can correspond to a field in another table, relating the rows in the two tables.

Relational Databases

There are actually different types of relational databases.

Structured Query Language (or SQL) databases are probably the most popular. In an SQL database, clients access the data store by submitting queries to the RDBMS (Relational Database Management System). The RDBMS collects the data requested in the query and returns it to the client in a result set. Examples of SQL databases are the open source MySQL and mSQL databases and commercial products like Oracle, Sybase and Microsoft SQL Server databases.

Another kind of relational database is the xBase or BerkeleyDB style database. The word xBase is derived from the dBase I,II,III,IV and FoxBase products. In this kind of database, tables are stored as individual files and methods are provided for searching through and accessing rows and columns in the files directly. No high level query language is provided.

Between xBase and SQL databases lie databases like Microsoft Access and FoxPro. They both provide both direct table access methods and a high level query language, though not SQL proper.

Flat file databases are another kind of relational database. In flat file databases, the data store is kept in some kind of human readable format. Flat file databases can be opened and edited by a text editor. They are most often used to store small amounts of data. Though file formats vary, columns are often quoted and seperated by commas or left unquoted and seperated by tabs. Alternatively, columns can be fixed length and white space padded. Rows are usually seperated by carriage returns. These kinds of databases are often proprietary to a particular software package and are accessible by using low level file manipulation commands or through an API provided by the package.

Transactional Databases

In the RDBMS world, there are 2 varieties: transactional and non-transactional. A transactional RDBMS maintains discrete sessions and provides commit and rollback methods. Changes made by one session are invisible to other sessions until a commit is executed. Changes can also be rolled back before they are committed. During a rollback, the state of the database prior to the changes is restored. Most expensive commercial RDBMS's are transactional while lower end commercial and free databases are not.

Accessing Databases

"How can I get data from my database onto the web?"

That is the burning question. There are many answers.

Significant factors in determining a solution are the choices of operating system, database and development tools. Being constrained by one constrains the others. Another significant factor is the way the database information needs to be displayed.

If the application merely displays information from an infrequently updated database, then generating web-presentable reports may be an effective solition. Commercial reporting tools like Crystal Reports and Oracle Reports can print straight to HTML. These tools only run on Microsoft Windows though there may be a similar tools for other platforms. At a lower level, most databases have scriptable tools for executing queries. Presentations can be generated periodically with one of these tools and processsed for the web with some template html files and crafty sed scripting. For databases with no scriptable query tools, programs can be written in a variety of languages to accomplish the same end result using API calls.

For building dynamic, database-driven applications, the simplest solution is to use a web application server. Most have built in connectivity with some set of relational databases and come with a web page builder where database objects can by simply dragged and dropped onto web pages. Web application servers are usually constrained in OS support, functionality, extensibility, scalability and in the databases that they can connect to.

Microsoft Visual InterDev is more flexible. It's not a web application server, but a web-based application development system. In InterDev, relational database objects can be dragged and dropped into web pages, but the pages are Active Server Pages (ASP's) and the database objects are (ultimately) accessed from them through ODBC (Open Database Connectivity). ASP's are very extensible and ODBC can connect to a lot of different relational databases, but in the end the application will most likely have to run through Microsoft Internet Information System (IIS) on a Microsoft Operating System.

Oracle provides a unique solution with Oracle 8i. In Oracle 8i the RDBMS attaches to an external web server like Apache, Netscape or Microsoft IIS and serves applications from within. Applications reside and execute inside the database as PL/SQL procedures.

Most non-drag-and-drop solutions require some programming and an API. Popular programming languages for writing web-based applications include PHP, Perl, Python, C/C++ and Java.

PHP provides methods for accessing most databases including ODBC, LDAP, BerkeleyDB, xBase and a variety of SQL databases. It also provides filesystem access methods.

Perl provides DBI modules which can access almost any SQL database that exists. The DBI modules provide perl interfaces to database API calls. LDAP and filesystem access modules exist too.

Python has modules similar to Perl.

Database API's are almost always C libraries and may be used directly by any C or C++ program or serverlet. Alternatively, the ODBC API may be used in place of a specific database API on many platforms. xBase databases can be accessed using Sequiter CodeBASE or a number of freeware libraries. The BerkeleyDB database libraries come with Unix. Various freeware LDAP API's are available as C libraries.

For connecting to Oracle databases from C or C++ programs, Pro C is another option. Pro C is a preprocessor that translates embedded SQL into Oracle API calls.

Java provides JDBC (Java Database Connectivity) which, like ODBC, can connect to a lot of different relational databases. Additionally, some commercial database vendors provide a Java API for their databases along with the C API.

Accessing Databases Without an API

Sometimes it's necessary to write a web-based application using a language for which no database API's exist. Most languages can either execute DLL (shared object libraries on Unix) or COM (Common Object Model) object methods or make command line calls. Shared object libraries can be written in C and command line programs can be written in some language listed above. Though some extra work (perhaps a lot of extra work) is involved, there is almost always a way to access a database from a language without an API.

An even more limiting issue is the platform issue. Database API's only exist for a limited set of platforms. Some databases only have API's for Microsoft operating systems on Intel-based hardware. Others support both Microsoft and Unix, but only some Unixes. What about MacOS? What about PowerPC Linux? What about Microsoft NT on Alpha? If the database API doesn't exist for a particular platform then the language-specific interface to it doesn't exist for that platform either. A program which makes non-existant interface calls won't run or compile.

There are solutions to the "no-API-on-my-platform" challenge.

If you need to access Microsoft SQL Server from Unix. SQL Server and Sybase both use the TDS (Tabular Data Stream) protocol, so if you're using ODBC on Unix the Sybase driver may work. Unfortunately, SQL Server 7 uses a slightly modified protocol. Fortunately, the FreeTDS project provides API's which work with all releases of Sybase and Microsoft SQL Server.

Distributed processing is a generic solution. RPC (Remote Procedure Call), Corba, DCOM (Distributed COM) and Java RMI (Remote Method Invocation) allow a program to invoke methods on other computers and get the results back locally. These are useful if you have another computer somewhere to farm out the database work to. C and C++ support RPC and Corba on almost any platform and DCOM on some platforms (including some Unixes, but it's really expensive). Java supports RMI. Perl and Python support some of these methods outright, and may be extended with modules to support more.

SQL Relay provides another solution for accessing many different databases from software on unsupported platforms. The SQL Relay API can be compiled for platforms on which the server doesn't run. The server can be run on a different machine where it communicates over the network with the client on one hand and the database on the other. SQL Relay provides a service similar to the distributed processing solutions above, only through a highly specialized interface.

Another solution is to write command line programs on a remote machine (for which the database API of interest exists) which execute database API calls and return the results to standard output or to a file on a shared volume. These command line programs can then be executed remtotely using rcmd or rexec. The rcmd and rexec client/server system is available on most Unix systems out of the box and as freeware for other platforms.

Performance Issues and Non-Issues

Some RDBMS's, especially transactional SQL RDBMS's have time consuming login procedures, presumably because they were designed with an older client-server paradigm in mind. In that paradigm, a client logs into an RDBMS when a user starts it up in the morning then accesses the database through the same connection over and over, all day long. The client doesn't logout until the user shuts it down at the end of the day and doesn't login again until the next day. Since clients login infrequently and at non-mission critical times, the amount of time associated with that particular activity is irrelevant. Another reason may be that RDBMS's place a lower priority on accepting new connections than on handling currently exectuing queries. Or it could just be that logins are really expensive in transactional RDBMS's because of the whole session/commit/rollback thing. At any rate, web-based applications may need to log into an RDBMS every time a new page is loaded and the time those operations take can make an application intolerably slow.

Some databases were designed a long time ago and have no RDBMS overhead, transactional or otherwise. These include filesystems, BerkeleyDB and xBase databases. Since the programmer is responsible for accessing the database using low level commands, database operations can be highly optimized. The most signifigant performance issue encountered with these kinds of databases results from programmers using poorly optimized access algorithms because they are easier to implement than more highly optimized algorithms.

LDAP databases are designed for very high performance access and lower performance updates. Everything about accessing an LDAP database is optimized for speed, including logging in.

Performance Solutions

Recognizing the performance issues associated with accessing RDBMS's from web-based applications, developers have devised some solutions.

One solution is to tune the database to make logins faster. This sounds like a good solution at first, but inevitably has performance consequences impacting other database activities.

Another solution is to attach the web-server directly to the RDBMS. Oracle 8i does just that. It has limitations though. Platform support is extensive but may exclude your particular platform, PL/SQL is no speed demon, and Oracle 8i might be too expensive for some pursuits.

The third and most popular solution is to maintain pools of RDBMS connections and loan them out to needy executables. Web application servers do this by logging in when the application server starts up and allowing application threads to access the database through the open connection. When run as an Apache module, PHP has the same capability. Persistent CGI's written using the FastCGI library can be written to do the same thing. SQL Relay provides a solution along these lines for accessing many different databases. There are other examples, but those come to mind immediately. This solution suffers with regard to dynamic scalability. As the demand on an application increases, more immediately available database connections are needed. Otherwise, the application stalls while new connections start up or old ones finish their current job. Application load must be predictable ahead of time for this solution to be flawlessly effective.

Next...

Hopefully, this discussion has shed some light on an unfamiliar topic for some and struck a familiar chord with others. For more information about SQL Relay, press your browser's back button and proceed to the next discussion in the list: Introduction to SQL Relay.