Home
Conditions
Contact
Links
Pricelist
Services
POP3
Telnet/SSH2
FTP
CGI/Perl5
PHP3
PHP4
WML/WAP
MySQL
SSL
Nederlands
|
The Webalizer - A web server log file analysis tool
Copyright 1997-1999 by Bradford L. Barrett (brad@mrunix.net)
Distributed under the GNU
GPL.
What is The Webalizer?
The Webalizer is a web server
log file analysis program which produces usage statistics in HTML format for
viewing with a browser. The results are presented in both columnar and graphical
format, which facilitates interpretation. Yearly, monthly, daily and hourly
usage statistics are presented, along with the ability to display usage by site,
URL, referrer, user agent (browser) and country (user agent and referrer are
only available if your web server procduces Combined log format files).
The Webalizer supports CLF
(common log format) log files, as well as Combined log formats as defined by
NCSA and others, and variations of these which it attempts to handle intelligently.
This documentation applies
to The Webalizer Version 1.30
Output Produced
The Webalizer produces several
reports (html) and graphics for each month processed. In addition, a summary
page is generated for the current and previous months (up to 12), a history
file is created and if incremental mode is used, the current month's processed
data. The exact location and names of these files can be changed using configuration
files and command line options. The files produced, (default names) are:
- index.html - Main summary page (extension
may be changed)
- usage.gif - Yearly graph displayed
on the main index page
- usage_YYYYMM.html - Monthly summary
page (extension may be changed)
- usage_YYYYMM.gif - Monthly usage
graph for specified month/year
- daily_usage_YYYYMM.gif - Daily usage
graph for specified month/year
- hourly_usage_YYYYMM.gif - Hourly
usage graph for specified month/year
- webalizer.hist - Previous month
history (may be changed)
- webalizer.current - Incremental
Data (may be changed)
The yearly (index) report
shows statistics for a 12 month period, and links to each month. The monthly
report has detailed statistics for that month with additional links to any URL's
and referrers found. The various totals shown are explained below.
Hits
Any request made to the
server which is logged, is considered a 'hit'. The requests can be for anything...
html pages, graphic images, audio files, cgi scripts, etc... Each valid line
in the server log is counted as a hit. This number represents the total number
of requests that were made to the server during the specified report period.
Files
Some requests made to the
server, require that the server then send something back to the requesting client,
such as a html page or graphic image. When this happens, it is considered a
'file' and the files total is incremented. The relationship between 'hits' and
'files' can be thought of as 'incoming requests' and 'outgoing responses'.
Pages
Pages are, well, pages!
Generally, any HTML document, or anything that generates an HTML document, would
be considered a page. This does not include the other stuff that goes into a
document, such as graphic images, audio clips, etc... This number represents
the number of 'pages' requested only, and does not include the other 'stuff'
that is in the page. What actually constitutes a 'page' can vary from server
to server. The default action is to treat anything with the extension '.htm',
'.html' or '.cgi' as a page. A lot of sites will probably define other extensions,
such as '.phtml', '.php3' and '.pl' as pages as well. Some people consider this
number as the number of 'pure' hits... I'm not sure if I totaly agree with that
viewpoint. Some other programs (and people :) refer to this as 'Pageviews'.
Sites
Each request made to the
server comes from a unique 'site', which can be referenced by a name or ultimately,
an IP address. The 'sites' number shows how many unique IP addresses made requests
to the server during the reporting time period. This DOES NOT mean the number
of unique individual users (real people) that visited, which is impossible to
determine using just logs and the HTTP protocol (however, this number might
be about as close as you will get).
Visits
Whenever a request is made
to the server from a given IP address (site), the amount of time since a previous
request by the address is calculated (if any). If the time difference is greater
than a preconfigured 'visit timeout' value (or has never made a request before),
it is considered a 'new visit', and this total is incremented (both for the
site, and the IP address). The default timeout value is 30 minutes (can be changed),
so if a user visits your site at 1:00 in the afternoon, and then returns at
3:00, two visits would be registered. Note: in the 'Top Sites' table, the visits
total should be discounted on 'Grouped' records, and thought of as the "Minimium
number of visits" that came from that grouping instead. Note: Visits only occur
on PageType requests, that is, for any request whose URL is one of the 'page'
types defined with the PageType option. Due to the limitation of the HTTP protocol,
log rotations and other factors, this number should not be taken as absolutely
accurate, rather, it should be considered a pretty close "guess".
KBytes
The KBytes (kilobytes) value
shows the amount of data, in KB, that was sent out by the server during the
specified reporting period. This value is generated directly from the log file,
so it is up to the webserver to produce accurate numbers in the logs (some web
servers do stupid things when it comes to reporting the number of bytes). In
general, this should be a fairly accurate representation of the amount of outgoing
traffic the server had, regardless of the web servers reporting quirks.
Note: A kilobyte is 1024
bytes, not 1000 :)
Top Entry and Exit Pages
The Top Entry and Exit Pages
give a rough estimate of what URL's are used to enter your site, and what the
last pages viewed are. Because of limitations in the HTTP protocol, log rotations,
etc... this number should be considered a good "rough guess" of the actual numbers,
however will give a good indication of the overall trend in where users come
into, and exit, your site.
Notes on Referrers
Referrers are weird critters...
They take many shapes and forms, which makes it much harder to analyze than
a typical URL, which at least has some standardization. What is contained in
the referrer field of your log files varies depending on many factors, such
as what site did the referral, what type of system it comes from and how the
actual referal was generated. Why is this? Well, because a user can get to your
site in many ways... They may have your site bookmarked in their browser, they
may simply type your sites URL field in their browser, they could have clicked
on a link on some remote web page or they may have found your site from one
of the many search engines and site indexes found on the web. The Webalizer
attempts to deal with all this variation in an intelligent way by doing certain
things to the referrer string which makes it easier to analyze. Of course, if
your web server doesn't provide referrer information, you probably don't really
care and are asking yourself why you are reading this section...
Most referrer's will take
the form of "http://somesite.com/somepage.html", which is what you will get
if the user clicks on a link somewhere on the web in order to get to your site.
Some will be a variation of this, and look something like "file:/some/such/sillyname",
which is a reference from a HTML document on the users local machine. Several
variations of this can be used, depending on what type of system the user has,
if he/she is on a local network, the type of network, etc... To complicate things
even more, dynamic HTML documents and HTML documents that are generated by cgi
scripts or external programs produce lots of extra information which is tacked
on to the end of the referrer string in an almost infinate number of ways. If
the user just typed your URL into their browser or clicked on a bookmark, there
won't be any information in the referrer field and will take the form "-".
In order to handle all these
variations, The Webalizer parses the referrer field in a certain way. First,
if the referrer string begins with "http", it assumes it is a normal referral
and converts the "http://" and following hostname to lowercase in order to simplify
hiding if desired. For example, the referrer "WWW.MyHost.Com/This/HTML/Document.html"
will become "www.myhost.com/This/HTML/Document.html". Notice that only the "http://"
and hostname are converted to lower case... The rest of the referrer field is
left alone. This follows standard convention, as the actuall method (HTTP) and
hostname are always case insensitive, while the document name portion is case
sensitive.
Referrers that came from
search engines, dynamic HTML documents, cgi scripts and other external programs
usually tack on additional information that it used to create the page. A common
example of this can be found in referrals that come from search engines and
site indexes common on the web. Sometimes, these referrers URL's can be several
hundred characters long and include all the information that the user typed
in to search for your site. The Webalizer deals with this type of referrer by
stripping off all the query information, which starts with a question mark '?'.
The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will be
converted to just "http://search.yahoo.com/search".
When a user comes to your
site by using one of their bookmarks or by typing in your URL directly into
their browser, the referrer field is blank, and looks like "-". Most sites will
get more of these referrals than any other type. The Webalizer converts this
type of referral into the string "- (Direct Request)". This is done in order
to make it easier to hide via a command line option or configuration file option.
This is because the character "-" is a valid character elsewhere in a referrer
field, and if not turned into something unique, could not be hidden without
possibly hiding other referrers that shouldn't be.
Notes on Character Escaping
The HTTP protocol defines
certain ways that URL's can look and behave. To some extent, referrer fields
follow most of the same conventions. Character escaping is a technique by which
non-printable or other non-ASCII (and even some ASCII) characters can be used
in a URL. This is done by placing the Hexdecimal value of the character in the
URL, preceeded by a percent sign '%'. Since Hex values are made up of ASCII
characters, any character can be escaped to ensure only printable ASCII characters
are present in the URL. Some systems take this concept to the extreme and escape
all sorts of stuff, even characters that don't need to be escaped. To deal with
this, The Webalizer will un-escape URL's and referrers before being processed.
For Example, the URL "/www.mrunix.net/%7Ebrad/resume.html" is the same URL as
"/www.mrunix.net/~brad/resume.html", a very common form of a URL to access users
web pages. If the URL's were not un-escaped, they would be treated as two seperate
documents, even though they are really one and the same.
Search String Analysis
The Webalizer will do a
minimal analysis on referrer strings that it finds, looking for well known search
string patterns. Most of the major search engines are supported, such as yahoo,
altavista, lycos, etc... Unfortunately, search engines are always changing their
internal/CGI query formats, new search engines are coming on line every day,
and the ability to detect _all_ search strings is nearly impossible. However,
it should be accurate enough to give a good indication of what users were searching
for when they stumbled across your site.
Notes on Visits/Entry/Exit
Figures
The majority of data analyzed
and reported on by The Webalizer is as accurate and correct as possible based
on the input log file. However, due to the limitation of the HTTP protocol,
the use of firewalls, proxy servers, multi-user systems, the rotation of your
log files, and a myriad of other conditions, some of these numbers cannot, without
absolute accuracy, be calculated. In particular, Visits, Entry Pages and Exit
Pages are suspect to random errors due to the above and other conditions. The
reason for this is twofold, 1) Log files are finite in size and time interval,
and 2) There is no way to distinguish multiple individual users apart given
only an IP address. Because log files are finite, they have a begining and ending,
which can be represented as a fixed time period. There is no way of knowing
what happened previous to this time period, nor is it possible to predict future
events based on it. Also, because it is impossible to distinguish individual
users apart, multiple users that have the same IP address all appear to be a
single user, and are treated as such. This is most common where corporate users
sit behind a proxy/firewall to the outside world, and all requests appear to
come from the same location (the address of the proxy/firewall itself). Dynamic
IP assignment (used with dial-up internet accounts) also present a problem,
since the same user will appear as to come from multiple places.
For example, suppose two
users visit your server from XYZ company, which has their network connected
to the internet by a proxy server 'fw.xyz.com'. All requests from the network
look as though they originated from 'fw.xyz.com', even though they were really
initiated from two seperate users on different PC's. The Webalizer would see
these requests as from the same location, and would record only 1 visit, when
in reality, there were two. Because entry and exit pages are calculated in conjunction
with visits, this situation would also only record 1 entry and 1 exit page,
when in reality, there should be 2.
As another example, say
a single user at XYZ company is surfing around your website.. They arrive at
11:52pm the last day of the month, and continue surfing until 12:30am, which
is now a new day (in a new month). Since a common practice is to rotate (save
then clear) the server logs at the end of the month, you now have the users
visit logged in two different files (current and previous months). Because of
this (and the fact that the Webalizer clears history between months), the first
page the user requests after midnight will be counted as an entry page. This
is unavoidable, since it is the first request seen by that particular IP address
in the new month.
For the most part, the numbers
shown for visits, entry and exit pages are pretty good 'guesses', even though
they may not be 100% accurate. They do provide a good indication of overall
trends, and shouldn't be that far off from the real numbers to count much. You
should probably consider them as the 'minimum' amount possible, since the actual
(real) values should always be equal or greater in all cases.
Final Notes
A lot of time and effort
went into making The Webalizer, and to ensure that the results are as accurate
as possible. If you find any abnormalities or inconsistant results, bugs, errors,
ommisions or anything else that doesn't look right, please let me know so I
can investigate the problem or correct the error. This goes for the minimal
documentation as well. Suggestions for future versions are also welcome and
apperciated.
|



|