ORF Reporting: Design, Part 3: ORF PowerLogs

PowerLogs Raw CSV ScreenshotI gave you a brief introduction to ORF 3.0 PowerLogs in my previous article, now I would like to tell you more about the details.

PowerLogs aim to provide much more detailed information than the current ORF text logs do. Some of this new data is needed for reporting and others are merely for the user, such as the subject of the email. Let’s see what’s new:

  • Email subject. Yes, the logs will tell you the subject of the email :) This will help identifying false positives and emails in general. We have to admit that logging the Message-ID did not really work out, but we could not log the subject previously due to the ANSI charset limitation.
    Of course, this will only work at the On Arrival filtering point, as there is no such thing as “email” at the Before Arrival filtering point.
  • Complete test log of every email. This is required by Reporting—in order to generate test effectiveness reports for individual tests, we must be able to tell how many emails the IP Blacklist or the Spamcop DNSBL checked and caught.
  • List references. Got an email blacklisted by the Keyword Filtering, but you have no idea which one caused the blacklisting? No problem, the log will tell you which keyword filter it was. I mean, not only the keyword filter comment (if any), but the keyword filter itself. Similarly, it will tell you the related IP Whitelist, Sender Blacklist, DNSBL or SURBL expression. This also means that you will be able to generate reports about individual DNSBLs effectiveness (maybe not in 3.0, but in a later version—the data is there).

Some more unsorted facts:

  • Logs and list references (see above) will be stored in separate files (n:1 relation, n log files and 1 reference file) to save disk space.
  • Both files will be in regular CSV format, with Unicode UTF-8 encoding. No more “Unicode comment cannot be logged” complains :)
  • Timestamps will be in Coordinated Universal Time (UTC), to avoid Daylight Saving Time (Summer Time) problems.
  • Log file name format will be fixed and will always contain the date when the log was generated. ORF test logs offer flexibility in this regard, but unfortunately that flexibility adds lots of extra complexity to log processing . The date will be generated in UTC, which might be a bit strange for those whose time zone offset is more than 1-2 hours.

Obviously, the more data logged means larger log files. Lot larger. Due to this, real-time reporting with full log processing would have very poor performance. Our fine-tuned CSV reader has about 8Mb/s processing speed, which is nowhere near to the performance needed to generate yearly reports reasonably fast (it would take hours on a high-load server). Also, those who run servers with high load will have gigabytes of logs in a year, which they will barely keep, not even for reporting.

To reduce the time needed, ORF will generate preprocessed report files daily (or more often). As ORF users have to be able to generate reports for a specific time range, e.g. for Jan 1, 2006-July 1, 2006, we cannot just make an incremental report. Instead, full reports with a given resolution are needed and when the user requests a report for Q3 2006, generate a summary of these preprocessed reports between the above dates. Of course, preprocessed reports will also take disk space, so resolution might not be 1 second or 1 minute. As the range of reports and their exact contents are yet to be specified, the question of resolution is still open, but it may be 1 hour or 24 hours.

6 thoughts on “ORF Reporting: Design, Part 3: ORF PowerLogs

  1. Alianz

    Seems like something I could definitely use. The logging of subject lines would definitely be helpful.

    I am not sure that you have fully firmed up on the concept though.

    It sounds like it would run like the Reports from Microsoft’s ISA, i.e. basic logging to one or more files, and then daily report generation.

    As an advanced user I would like to have the option of logging to a database. This could however be added using my own code, simply be reading the logs. The key is to log using a consistent CSV making it easy to load into a table.

  2. Peter

    I don’t really know ISA Server, but AWStats does something similar. The point of this “preprocessing design” is to divide the otherwise very exhaustive processing into smaller and faster steps, thus reducing the response time.

    With regard to direct database logging, we currently do not plan such thing for practical reasons and PowerLogs will be actually much harder to fit into the SQL table concept.

    This is because of the PowerLog event diversity. We have 3 main log entities defined so far: Server Events, Automatic Sender Whitelist Additions and Mail-Related Events. Server Events offer only untyped/unspecified specializations (e.g. event subtype and custom data in custom format). This means that they are not hiarerchical and you can easily store them in a table. Auto Sender Whitelist Additions are fully defined and final, so they are also OK for tables. Mail-Related Events are also fully defined and final, but they contain something called a “microlog”, which is the test log of the given email. Micrologs may contain up to 20+ entities which has very little in common. This means 20+ new tables. Also microevents can have references to keyword filters, attachment filters, DNSBLs, etc., a total of 8 entity types.

    Of course, if you ever want to generate SQL-based reports from this, the single table design would not work (not to mention type problems or column number limitations). In an SQL view of things, you would easily end up in 30 tables with 1:n:m:… relationships.

  3. Alianz

    OIC, but I dont think it would be too hard to decode into a table structure, but not with standard utilities.

    A micro log is a record with a repeating element? Needs first normal form normalisation to get into two tables. Maybe the repeating elements have varying formats?

    No worries it can still be normalised into a table structure as long as it sticks to some rules: a record type indicator and well defined structure per type.

    I was wondering why CSV and not tab delimited as some text fields like the subject could contain commas?

    Having data in SQL allows me to do things you didn’t provide for. E.g. the other day an old friend of mine sent me an email from Chevron (Caltex) in USA. The mail went first to the primary then to the secondary and in both cases was greylisted. The MTA gave up at this point.

    I only noticed it by accident as normally there just too much junk to wade through.

    I would want to log both primary and secondary to the same db with an extra column then run an SQL to look for similar situations.

  4. Peter

    Sorry for the late response.

    Sure, I am not saying it cannot be done, I just say it would a PITA to SQL use, thus it would have little audience and P/E ratio.

    Microevents are the test log of the email, within a single PowerLog entry and they differ a lot from each other; an IP whitelist test has totally different properties than an SURBL test. There are 20+ different entities, which would require 20+ tables just for the microevents, when properly normalized.

    As for the format, TAB delimited vs. comma-delimited makes very little difference, because some data logged may still have TABs embedded, so the same quoting is required as with commas. The general CSV rule is to enclose the field within double quotes when the field separator is present inside the field.

  5. Alianz

    Thanks.

    I may not denormalise the microlog then, simply store it in a single column.

    Another thing I would like to be able to drill into is IP addres behind the IP address when the mail arrives via intermediate host. Would this be in the Microlog?

  6. Pingback: Vamsoft Insider » …And… Action!

Leave a Reply

Your email address will not be published. Required fields are marked *

AlphaOmega Captcha Classica  –  Enter Security Code