The role of the software architect is wonderful. As an architect, you face with a number of challenging design issues. You have to find the balance between time, resources, cost and requirements. You have to make compromises, even if you hate them. And at the end, you will blamed by the users :)
Now that I am designing the Reporting features of ORF 3.0, I find the role of the architect rather painful. If you have the strange passion to watch others making poor compromises, read on.
Due to the attention that we pay to performance, ORF is capable of filtering millions of email every day. It is not just marketing: we are aware of a few installations where ORF filters 1-3 million emails per day per server. I do not know what you say, but it is impressing for me. And it causes design headaches as well.
The new Reporting features in ORF 3.0 will be log-based, that is, reports will be derived from log data. As a single filtered email results in 1 to 2 or even more log entries, there would be about 365.000.000 to 730.000.000 log entries in a yearly report for a large client. In our logs (Verbose mode), the average log entry length is 190 bytes, so the size of the logs to be processed for a year will be around 69,350,000,000 â€“ 138,700,000,000 bytes–that is, 64.6Gb â€“ 129,2Gb.
That is awful lot of data to be processed. The Short log mode reduces this a little, but not that much. The log parser of ORF is optimized for speed, so its has pretty high throughput: about 22.07Mb/sec on my developer workstation. This means that parsing 64.6Gb log will take at about 2997 seconds or 50 minutes; in the less lucky case about 2 hours. And it is just parsing, transforming data for the reports will add significant delay. Alas, I do not really see how this could be made faster. Lots of information takes lots of time to be processed. Sorry guys, you will have to wait for your reports.
However, 90% of our clients fall in the 0-10,000 emails per day category, so most will have logs smaller by factor of 100 and 3.65 million entries will parse in 30 seconds. Or, if the client has faster disks, maybe even less â€“ yes, the disk I/O can be a bottleneck.
This was speed. In my next post, I will talk about the significant concerns.