Introduction to Database Access from Web-Based Applications
It's rare these days that a substantial web-based application is written
that doesn't need to access some sort of database. This discussion examines (in
reasonably high level detail) the different tools available for developing web
based applications and the database access options a developer has with each.
The definition of "Database"
First, the question "What do you mean when you say database?" needs to be
answered. That may seem like a no-brainer, but I've argued long and hard with
people about the definition of the word "database" in the past. For the
purposes of this discussion, a database is an organized store of
information that an application can access. That's a very general definition,
and there is probably a more specific and perhaps more accurate definition
somewhere on the net, but that's what "database" means in this discussion.
Different Types of Databases
There are quite a few different types of databases. Heirarchical and
relational databases are probably the only ones relevant to a developer today.
Heirarchical databases are constructed like a tree (actually an acyclic
(or infrequently a cyclic) graph) of nodes. Each node must be referenced by
the path to that node from some other node. Individual nodes can contain a variety of information, but the tree structure is what defines the database
type. Examples of heirarchical databases are LDAP databases and filesystems.
Yes, a filesystem is a actually a database; a heirarchical one.
Relational databases are collections of tables. Each table consists of
rows and columns. The intersection of a row and column is called a field.
Columns have a data type such as character, numeric or boolean and usually
a length. What makes this whole thing relational is that a field in one
table can correspond to a field in another table, relating the rows in the two
tables.
Relational Databases
There are actually different types of relational databases.
Structured Query Language (or SQL) databases are probably the most popular.
In an SQL database, clients access the data store by submitting queries to the
RDBMS (Relational Database Management System). The RDBMS collects the data
requested in the query and returns it to the client in a result set.
Examples of SQL databases are the open source MySQL and mSQL databases and
commercial products like Oracle, Sybase and Microsoft SQL Server databases.
Another kind of relational database is the xBase or BerkeleyDB style
database. The word xBase is derived from the dBase I,II,III,IV and FoxBase
products. In this kind of database, tables are stored as individual files
and methods are provided for searching through and accessing rows and
columns in the files directly. No high level query language is provided.
Between xBase and SQL databases lie databases like Microsoft Access and
FoxPro. They both provide both direct table access methods and a high level
query language, though not SQL proper.
Flat file databases are another kind of relational database. In flat
file databases, the data store is kept in some kind of human readable format.
Flat file databases can be opened and edited by a text editor. They are most
often used to store small amounts of data. Though file formats vary, columns
are often quoted and seperated by commas or left unquoted and seperated by
tabs. Alternatively, columns can be fixed length and white space padded.
Rows are usually seperated by carriage returns. These kinds of databases are
often proprietary to a particular software package and are accessible by
using low level file manipulation commands or through an API provided by the
package.
Transactional Databases
In the RDBMS world, there are 2 varieties: transactional and
non-transactional. A transactional RDBMS maintains discrete sessions
and provides commit and rollback methods. Changes made by one session are
invisible to other sessions until a commit is executed. Changes can
also be rolled back before they are committed. During a rollback, the state
of the database prior to the changes is restored. Most expensive commercial
RDBMS's are transactional while lower end commercial and free databases
are not.
Accessing Databases
"How can I get data from my database onto the web?"
That is the burning question. There are many answers.
Significant factors in determining a solution are the choices of operating
system, database and development tools. Being constrained by one constrains
the others. Another significant factor is the way the database information
needs to be displayed.
If the application merely displays information from an
infrequently updated database, then generating web-presentable reports may
be an effective solition. Commercial reporting tools like Crystal Reports
and Oracle Reports can print straight to HTML. These tools only run on
Microsoft Windows though there may be a similar tools for other platforms.
At a lower level, most databases have scriptable tools for executing queries.
Presentations can be generated periodically with one of these tools and
processsed for the web with some template html files and crafty sed
scripting. For databases with no scriptable query tools, programs can be
written in a variety of languages to accomplish the same end result using API
calls.
For building dynamic, database-driven applications, the simplest
solution is to use a web application server. Most have built in connectivity
with some set of relational databases and come with a web page
builder where database objects can by simply dragged and dropped onto
web pages. Web application servers are usually constrained in OS support,
functionality, extensibility, scalability and in the databases that they can
connect to.
Microsoft Visual InterDev is more flexible. It's not a web
application server, but a web-based application development system. In
InterDev, relational database objects can be dragged and dropped into web pages,
but the pages are Active Server Pages (ASP's) and the database objects are
(ultimately) accessed from them through ODBC (Open Database Connectivity).
ASP's are very extensible and ODBC can connect to a lot of different
relational databases, but in the end the application will most likely have to
run through Microsoft Internet Information System (IIS) on a Microsoft
Operating System.
Oracle provides a unique solution with Oracle 8i. In Oracle 8i the
RDBMS attaches to an external web server like Apache, Netscape or Microsoft IIS
and serves applications from within. Applications reside and execute inside
the database as PL/SQL procedures.
Most non-drag-and-drop solutions require some programming and an API.
Popular programming languages for writing web-based applications include PHP,
Perl, Python, C/C++ and Java.
PHP provides methods for accessing most databases including ODBC, LDAP,
BerkeleyDB, xBase and a variety of SQL databases. It also provides filesystem
access methods.
Perl provides DBI modules which can access almost any SQL database that
exists. The DBI modules provide perl interfaces to database API calls.
LDAP and filesystem access modules exist too.
Python has modules similar to Perl.
Database API's are almost always C libraries and may be used directly by
any C or C++ program or serverlet. Alternatively, the ODBC API may be used in
place of a specific database API on many platforms. xBase databases can be
accessed using Sequiter CodeBASE or a number of freeware libraries. The
BerkeleyDB database libraries come with Unix. Various freeware LDAP API's are
available as C libraries.
For connecting to Oracle databases from C or C++ programs, Pro C is another
option. Pro C is a preprocessor that translates embedded SQL into Oracle API
calls.
Java provides JDBC (Java Database Connectivity) which, like ODBC, can
connect to a lot of different relational databases. Additionally, some
commercial database vendors provide a Java API for their databases along with
the C API.
Accessing Databases Without an API
Sometimes it's necessary to write a web-based application using a language
for which no database API's exist. Most languages can either execute
DLL (shared object libraries on Unix) or COM (Common Object Model) object
methods or make command line calls. Shared object libraries can be written in
C and command line programs can be written in some language listed above.
Though some extra work (perhaps a lot of extra work) is involved, there is
almost always a way to access a database from a language without an API.
An even more limiting issue is the platform issue. Database API's only exist
for a limited set of platforms. Some databases only have API's for Microsoft
operating systems on Intel-based hardware. Others support both Microsoft and
Unix, but only some Unixes. What about MacOS? What about PowerPC Linux?
What about Microsoft NT on Alpha? If the database API doesn't exist for a
particular platform then the language-specific interface to it doesn't exist
for that platform either. A program which makes non-existant interface
calls won't run or compile.
There are solutions to the "no-API-on-my-platform" challenge.
If you need to access Microsoft SQL Server from Unix. SQL Server and Sybase
both use the TDS (Tabular Data Stream) protocol, so if you're using ODBC on
Unix the Sybase driver may work. Unfortunately, SQL Server 7 uses a slightly
modified protocol. Fortunately, the FreeTDS project provides API's which work
with all releases of Sybase and Microsoft SQL Server.
Distributed processing is a generic solution. RPC (Remote Procedure Call),
Corba, DCOM (Distributed COM) and Java RMI (Remote Method Invocation)
allow a program to invoke methods on other computers and get the results back
locally. These are useful if you have another computer somewhere to farm out
the database work to. C and C++ support RPC and Corba on almost any platform
and DCOM on some platforms (including some Unixes, but it's really expensive).
Java supports RMI. Perl and Python support some of these methods outright, and
may be extended with modules to support more.
SQL Relay provides another solution for accessing many different
databases from software on unsupported platforms. The SQL Relay API can
be compiled for platforms on which the server doesn't run. The server can be
run on a different machine where it communicates over the network with the
client on one hand and the database on the other. SQL Relay provides a
service similar to the distributed processing solutions above, only through a
highly specialized interface.
Another solution is to write command line programs on a remote machine (for
which the database API of interest exists) which execute database API calls
and return the results to standard output or to a file on a shared volume.
These command line programs can then be executed remtotely using rcmd or
rexec. The rcmd and rexec client/server system is available on most Unix
systems out of the box and as freeware for other platforms.
Performance Issues and Non-Issues
Some RDBMS's, especially transactional SQL RDBMS's have time consuming
login procedures, presumably because they were designed
with an older client-server paradigm in mind. In that paradigm, a client
logs into an RDBMS when a user starts it up in the morning then accesses
the database through the same connection over and over, all day long.
The client doesn't logout until the user shuts it down at the end of the day
and doesn't login again until the next day. Since clients login infrequently
and at non-mission critical times, the amount of time associated with that
particular activity is irrelevant. Another reason may be that RDBMS's place a
lower priority on accepting new connections than on handling currently
exectuing queries. Or it could just be that logins are really expensive in
transactional RDBMS's because of the whole session/commit/rollback thing. At
any rate, web-based applications may need to log into an RDBMS every time a
new page is loaded and the time those operations take can make an application
intolerably slow.
Some databases were designed a long time ago and have no RDBMS overhead,
transactional or otherwise. These include filesystems, BerkeleyDB and xBase
databases. Since the programmer is responsible for accessing the database
using low level commands, database operations can be highly optimized. The
most signifigant performance issue encountered with these kinds of databases
results from programmers using poorly optimized access algorithms because they
are easier to implement than more highly optimized algorithms.
LDAP databases are designed for very high performance access and lower
performance updates. Everything about accessing an LDAP database is optimized
for speed, including logging in.
Performance Solutions
Recognizing the performance issues associated with accessing RDBMS's from
web-based applications, developers have devised some solutions.
One solution is to tune the database to make logins faster. This
sounds like a good solution at first, but inevitably has performance
consequences impacting other database activities.
Another solution is to attach the web-server directly to the RDBMS. Oracle
8i does just that. It has limitations though. Platform support is
extensive but may exclude your particular platform, PL/SQL is no speed demon,
and Oracle 8i might be too expensive for some pursuits.
The third and most popular solution is to maintain pools of RDBMS
connections and loan them out to needy executables. Web application servers
do this by logging in when the application server starts up and allowing
application threads to access the database through the open connection.
When run as an Apache module, PHP has the same capability. Persistent CGI's
written using the FastCGI library can be written to do the same thing.
SQL Relay provides a solution along these lines for accessing
many different databases. There are other examples, but those come to mind
immediately. This solution suffers with regard to dynamic scalability. As the
demand on an application increases, more immediately available database
connections are needed. Otherwise, the application stalls while new connections
start up or old ones finish their current job. Application load must be
predictable ahead of time for this solution to be flawlessly effective.
Next...
Hopefully, this discussion has shed some light on an unfamiliar topic for
some and struck a familiar chord with others. For more information about
SQL Relay, press your browser's back button and proceed to the next
discussion in the list: Introduction to SQL Relay.