ORF has two major issues with the Greylisting and the Auto Sender Whitelist databases:
- Sharing: databases cannot be shared between ORF instances
- Reliability: the database engine is not rock solid on some systems
Sharing is a simple case. Database sharing would be needed on networks with more inbound email servers: it is no fun when an email gets temporarily rejected two times by Greylisting, just because the left hand doesn’t know what the right hand is doing.
Reliability is a bigger issue. Only few systems are affected by engine reliability issues (namely, database corruption) and we found no common points between them, also the issue never reproduced in our lab. Under unrealistically high direct engine load, we could get some database operations to fail, but we could never achieve anything as serious as a database corruption. Still, the fact that we could drive the database into failures indicates that something is wrong with the thread safety of the database engine.
In ORF 4.0, we will address both of these issues.
To fix the reliability problems, ORF 4.0 will limit all database operations to a single thread only. The engine is proven robust with single threaded use and although this limitation reduces the throughput, the database performance will be OK for most smaller systems. Typically, ORF operations on complete well within 1 second (even on a large database), so 60 * 60 * 24 = 86.400 operations can be performed a day, enough for approx. 50.000 emails/day. It is just rough estimation, but I think the actual figures will be better and for the question of scaling, the answer will be the same as for the other major issue.
Database sharing will be implemented using external databases, with support for the following servers: Microsoft SQL Server (MSSQL) 2005 Express Edition, MSSQL 2000 and MSSQL 2005 (initially). The benefits of this model are numerous:
- allows sharing the databases between ORF instances,
- provides high performance,
- can scale well (SQL Server Express -> SQL Server),
- allows programmatic access to Auto Sender Whitelist data.
The price of choosing SQL servers is that the databases have to be created manually. We will provide downloadable guides with setup instructions and the necessary SQL statements, but running database servers requires some field-specific knowledge (e.g. adding users, configuring network access, etc.) that we cannot fully eliminate by guides.
I welcome this as a very useful feature. I often experience the double greylisting problem.
I guess that if I move the greylist to an SQL server, it has to reside somewhere. I could put it on a single high reliability server, but I can’t help thinking what happens when that server goes down?
Ideally there should be redundant SQL servers with some form of replication between them? I could actually set this up myself if it’s not in the release, and could write the how-to.
The nice thing about Microsoft SQL Server is the scalability. You can get the free SQL Server 2005 Express version and if you grow it out, you can always upgrade to the commercial SQL Server version. If database availability is a major concern, you can set up a failover cluster (see http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/failclus.mspx).
Last time I costed an SQL Server fail-over cluster, I couldn’t afford one. I have a lot of boxes but most are old. When I buy a new box it’s not at the server (e.g. XEON) level, mostly I build them myself. Clustering requires SCSSI controllers and a RAID cabinet $$$.
Something else: having worked in places where organisations rely on these things, I often experienced situations of what I always suspected could happen. Your RAID cabinet fails, or two disks in the cabinet fail at the same time.
Another problem with SQL Server clustering is the amount of time it takes to fail-over. This might not be so important for email delivery though, but I will tell you about it anyway while we are on the subject.
Back in 1999 when SQL Server 2000 was in beta, Microsoft were talking about their following release called “yukon”. They stated that it would be a fully load-balanced and fault tolerant system etc. Well Yukon was delayed and eventually became SQL Server 2005, and the promised features were not included.
Mean while ORACLE released something like this where the server could fail-over almost instantly. The trouble with this release was that it involved a concept known as the shared lock manager. When you have two servers wanting the same disk block and one has it in memory, the other has to ping the first to get it to write the block out to disk. This means that the cluster has a lower performance compared to a single server.
Oracle have since worked through the issues and evolved Grid Computing, and I suspect the Microsoft marketing people realised that developing something similar would be very expensive and it would not be wise to compete with Oracle head-on.
Back in 2002 I was working with a system for controlling and monitoring the Rail Network. Their system they wrote themselves over 12 years had many layers. In every layer, each server had to be fully backed up by a partner system on another computer. The parters were hot standbys talking to each other to ensure things were in sync.
One small part of this system was a WEB farm that displayed maps of the network showing the location of the trains. This used a pair of SQL Servers for data.
The systems used to write to both SQL Servers at once. They looked at Clustering but the time delay involved in the fail-over was not acceptable. They also looked at Oracle but for mostly political and budgetary reasons that idea never saw the light of day.
I was faced with issues in this arrangement. The servers all through were generating auto numbers that were not the same as those of their partners. I had the problem of cleaning up redundant data into a single copy. To do this a wrote an improvement to the Time Server system that comes with NT and keeps te clocks in sync but only to within 15 seconds. My system got this to within 60ms so that time-stamps plus location could identify redundant data.
Back to email: my ideal setup is to have two big x64 boxes acting as Domain Controllers, SQL Servers and Exchange Servers. In front of these two are the servers running ISA and ORFEE.
To fail-over ORFEE to the other SQL I might write something that changes the config and restarts the service. I might use SQL Server replication to keep each others Grey Lists etc. up-to-date, but as you know SQL server Replication is heavy weight and has issues.
So instead I might simply write something based on 2005 notifications to replicate data over.
Peter, Please clarify the database direction: will the use of an external SQL database be optional, allowing installations that don’t require multiple ORF instances to continue using the current internal database?
Eric,
My reading of Peter’s statement is that they will be retaining the current embeded SQL engine but limiting it to a single threaded operation. The external SQL Server will be an optional alternative for those who need it.
Eric,
Using external SQL databases will be optional and ORF will keep providing the internal/embedded database option. However, the performance of the embedded database will be cut back by the single-thread limit.
Pingback: Vamsoft Insider » ORF 4.0 Changes
The estimated number of emails, that’s per server, correct? For larger organizations, typically a minimum of 2 servers exist. In theory, you could add as necessary, especially if you’re receiving “that many” emails.
ORF is one of the better products out there, unfortunately I’ve learned the hard way. Don’t want ORF to be discounted in any scenario, based on the statement “the database performance will be OK for most smaller systems”. :-)
Eric: Yes, the estimation is per server. But if you plan sharing the email traffic between servers, you should go with External Databases anyway, because only these can share data between servers. Do not worry, it will not be painful, my coworker Krisztian made an excellent guide for the (free) Microsoft SQL Server 2005 version.
Pingback: Vamsoft Insider » Tales from Tech Support: Part 1 - Databases
Pingback: throughput limitation