Excerpts
Webalizer README
The Webalizer produces several reports (html) and
graphics for each
month processed. In addition, a summary page is generated
for the
current and previous months (up to 12), a history
file is created
and if incremental mode is used, the current month's
processed data.
The exact location and names of these files can be
changed using
configuration files and command line options. The
files produced,
(default names) are: index.html
- Main summary page (extension may be changed)
usage.png
- Yearly graph displayed on the main index
page
usage_YYYYMM.html - Monthly summary page (extension
may be changed)
usage_YYYYMM.png - Monthly usage graph
for specified month/year
daily_usage_YYYYMM.png - Daily
usage graph for specified month/year
hourly_usage_YYYYMM.png
- Hourly usage graph for specified month/year
site_YYYYMM.html
- All sites listing (if enabled)
url_YYYYMM.html -
All urls listing (if enabled)
ref_YYYYMM.html - All
referrers listing (if enabled)
agent_YYYYMM.html -
All user agents listing (if enabled)
search_YYYYMM.html
- All search strings listing (if enabled)
webalizer.hist
- Previous month history (may be changed)
webalizer.current
- Incremental Data (may be changed)
site_YYYYMM.tab
- tab delimited sites file
url_YYYYMM.tab - tab delimited
urls file
ref_YYYYMM.tab - tab delimited referrers file
agent_YYYYMM.tab
- tab delimited user agents file
user_YYYYMM.tab -
tab delimited usernames file
search_YYYYMM.tab - tab
delimited search string file
The
yearly (index) report shows statistics for a 12
month period, and
links to each month. The monthly report has detailed
statistics for
that month with additional links to any URL's and
referrers found.
The various totals shown are explained below.
Hits Any request made to the server which is logged,
is considered a 'hit'.
The requests can be for anything... html pages, graphic
images, audio
files, CGI scripts, etc... Each valid line in the
server log is
counted as a hit. This number represents the total
number of requests
that were made to the server during the specified
report period.
Files
Some
requests made to the server, require that the server
then send
something back to the requesting client, such as
a html page or graphic
image. When this happens, it is considered a 'file'
and the files
total is incremented. The relationship between 'hits'
and 'files' can
be thought of as 'incoming requests' and 'outgoing
responses'.
Pages Pages are, well, pages! Generally, any HTML document,
or anything
that generates an HTML document, would be considered
a page. This
does not include the other stuff that goes into a
document, such as
graphic images, audio clips, etc... This number represents
the number
of 'pages' requested only, and does not include the
other 'stuff' that
is in the page. What actually constitutes a 'page'
can vary from
server to server. The default action is to treat
anything with the
extension '.htm', '.html' or '.cgi' as a page. A
lot of sites will
probably define other extensions, such as '.phtml',
'.php3' and '.pl'
as pages as well. Some people consider this number
as the number of
'pure' hits... I'm not sure if I totally agree with
that viewpoint.
Some other programs (and people :) refer to this
as 'Pageviews'.
Sites
Each request made to the server comes from a unique
'site', which can
be referenced by a name or ultimately, an IP address.
The 'sites'
number shows how many unique IP addresses made requests
to the server
during the reporting time period. This DOES NOT mean
the number of
unique individual users (real people) that visited,
which is impossible
to determine using just logs and the HTTP protocol
(however, this
number might be about as close as you will get).
Visits
Whenever a request is made to the server from a
given IP address
(site), the amount of time since a previous request
by the address
is calculated (if any). If the time difference is
greater than a
pre-configured 'visit timeout' value (or has never
made a request before),
it is considered a 'new visit', and this total is
incremented (both
for the site, and the IP address). The default timeout
value is 30
minutes (can be changed), so if a user visits your
site at 1:00 in
the afternoon, and then returns at 3:00, two visits
would be registered.
Note: in the 'Top Sites' table, the visits total
should be discounted
on 'Grouped' records, and thought of as the "Minimum
number of visits"
that came from that grouping instead. Note: Visits
only occur on
PageType requests, that is, for any request whose
URL is one of the
'page' types defined with the PageType option. Due
to the limitation
of the HTTP protocol, log rotations and other factors,
this number
should not be taken as absolutely accurate, rather,
it should be
considered a pretty close "guess".
KBytes
The KBytes (kilobytes) value shows the amount of
data, in KB, that
was sent out by the server during the specified reporting
period. This
value is generated directly from the log file, so
it is up to the
web server to produce accurate numbers in the logs
(some web servers
do stupid things when it comes to reporting the number
of bytes). In
general, this should be a fairly accurate representation
of the amount
of outgoing traffic the server had, regardless of
the web servers
reporting quirks.
Note: A kilobyte is 1024 bytes, not 1000 :)
Top Entry and Exit Pages
The Top Entry and Exit tables give a rough estimate
of what URL's
are used to enter your site, and what the last pages
viewed are.
Because of limitations in the HTTP protocol, log
rotations, etc...
this number should be considered a good "rough
guess" of the actual
numbers, however will give a good indication of the
overall trend in
where users come into, and exit, your site.
Notes
on Referrers
Referrers are weird critters... They take many shapes
and forms, which makes
it much harder to analyze than a typical URL, which
at least has some
standardization. What is contained in the referrer
field of your log
files varies depending on many factors, such as what
site did the referral,
what type of system it comes from and how the actual
referral was generated.
Why is this? Well, because a user can get to your
site in many ways... They
may have your site bookmarked in their browser, they
may simply type your
sites URL field in their browser, they could have
clicked on a link on some
remote web page or they may have found your site
from one of the many search
engines and site indexes found on the web. The Webalizer
attempts to deal
with all this variation in an intelligent way by
doing certain things to
the referrer string which makes it easier to analyze.
Of course, if your
web server doesn't provide referrer information,
you probably don't really
care and are asking yourself why you are reading
this section...
Most
referrer's will take the form of "http://somesite.com/somepage.html",
which is what you will get if the user clicks on
a link somewhere on the
web in order to get to your site. Some will be a
variation of this, and
look something like "file:/some/such/sillyname",
which is a reference from
a HTML document on the users local machine. Several
variations of this can
be used, depending on what type of system the user
has, if he/she is on
a local network, the type of network, etc... To complicate
things even
more, dynamic HTML documents and HTML documents that
are generated by
CGI scripts or external programs produce lots of
extra information which
is tacked on to the end of the referrer string in
an almost infinite number
of ways. If the user just typed your URL into their
browser or clicked on
a bookmark, there won't be any information in the
referrer field and will
take the form "-".
In order to handle all these variations, The Webalizer
parses the referrer
field in a certain way. First, if the referrer string
begins with "http",
it assumes it is a normal referral and converts the "http://" and
following
hostname to lowercase in order to simplify hiding
if desired. For example,
the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will
become
"
http://www.myhost.com/This/Is/A/HTML/Document.html".
Notice that only the
"
http://" and hostname are converted to lower
case... The rest of the
referrer field is left alone. This follows standard
convention, as the
actual method (HTTP) and hostname are always case
insensitive, while the
document name portion is case sensitive.
Referrers that came from search engines, dynamic
HTML documents, CGI
scripts and other external programs usually tack
on additional information
that it used to create the page. A common example
of this can be found
in referrals that come from search engines and site
indexes common on the
web. Sometimes, these referrers URL's can be several
hundred characters
long and include all the information that the user
typed in to search for
your site. The Webalizer deals with this type of
referrer by stripping
off all the query information, which starts with
a question mark '?'.
The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will
be converted to just "http://search.yahoo.com/search".
When a user comes to your site by using one of their
bookmarks or by
typing in your URL directly into their browser, the
referrer field is
blank, and looks like "-". Most sites will
get more of these referrals
than any other type. The Webalizer converts this
type of referral into
the string "- (Direct Request)". This is
done in order to make it easier
to hide via a command line option or configuration
file option. This is
because the character "-" is a valid character
elsewhere in a referrer
field, and if not turned into something unique, could
not be hidden without
possibly hiding other referrers that shouldn't be.
Notes on Character Escaping
The HTTP protocol defines certain ways that URL's
can look and behave. To
some extent, referrer fields follow most of the same
conventions. Character
escaping is a technique by which non-printable or
other non-ASCII (and even
some ASCII) characters can be used in a URL. This
is done by placing the
Hexadecimal value of the character in the URL, preceeded
by a percent sign '%'.
Since Hex values are made up of ASCII characters,
any character can be
escaped to ensure only printable ASCII characters
are present in the URL.
Some systems take this concept to the extreme and
escape all sorts of stuff,
even characters that don't need to be escaped. To
deal with this, The
Webalizer will un-escape URL's and referrers before
being processed. For
Example, the URL "/www.mrunix.net/%7Ebrad/resume.html" is
the same URL as
"
/www.mrunix.net/~brad/resume.html", a very common
form of a URL to access
users web pages. If the URL's were not un-escaped,
they would be treated as
two separate documents, even though they are really
one and the same.
Search String Analysis
The Webalizer will do a minimal analysis on referrer
strings that
it finds, looking for well known search string patterns.
Most of
the major search engines are supported, such as Yahoo!,
Altavista,
Lycos, etc... Unfortunately, search engines are always
changing
their internal/CGI query formats, new search engines
are coming on
line every day, and the ability to detect _all_ search
strings is
nearly impossible. However, it should be accurate
enough to give
a good indication of what users were searching for
when they stumbled
across your site. Note: as of version 1.31, search
engines can now
be specified within a configuration file. See the
sample.conf file
for examples of how to specify additional search
engines.
Notes
on Visits/Entry/Exit Figures The majority of data analyzed and reported on by
The Webalizer is
as accurate and correct as possible based on the
input log file.
However, due to the limitation of the HTTP protocol,
the use of
firewalls, proxy servers, multi-user systems, the
rotation of your
log files, and a myriad of other conditions, some
of these numbers
cannot, without absolute accuracy, be calculated.
In particular,
Visits, Entry Pages and Exit Pages are suspect to
random errors
due to the above and other conditions. The reason
for this is
twofold, 1) Log files are finite in size and time
interval, and
2) There is no way to distinguish multiple individual
users apart
given only an IP address. Because log files are finite,
they have
a beginning and ending, which can be represented
as a fixed time
period. There is no way of knowing what happened
previous to this
time period, nor is it possible to predict future
events based on
it. Also, because it is impossible to distinguish
individual users
apart, multiple users that have the same IP address
all appear to
be a single user, and are treated as such. This is
most common where
corporate users sit behind a proxy/firewall to the
outside world,
and all requests appear to come from the same location
(the address
of the proxy/firewall itself). Dynamic IP assignment
(used with
dial-up internet accounts) also present a problem,
since the same
user will appear as to come from multiple places.
For example, suppose two users visit your server
from XYZ company,
which has their network connected to the Internet
by a proxy server
'fw.xyz.com'. All requests from the network look
as though they
originated from 'fw.xyz.com', even though they were
really initiated
from two separate users on different PC's. The Webalizer
would
see these requests as from the same location, and
would record only
1 visit, when in reality, there were two. Because
entry and exit
pages are calculated in conjunction with visits,
this situation
would also only record 1 entry and 1 exit page, when
in reality,
there should be 2.
As another example, say a single user at XYZ company
is surfing
around your website.. They arrive at 11:52pm the
last day of
the month, and continue surfing until 12:30am, which
is now a
new day (in a new month). Since a common practice
is to rotate
(save then clear) the server logs at the end of the
month, you
now have the users visit logged in two different
files (current
and previous months). Because of this (and the fact
that the
Webalizer clears history between months), the first
page the
user requests after midnight will be counted as an
entry page.
This is unavoidable, since it is the first request
seen by that
particular IP address in the new month.
For the most part, the numbers shown for visits,
entry and exit
pages are pretty good 'guesses', even though they
may not be 100%
accurate. They do provide a good indication of overall
trends,
and shouldn't be that far off from the real numbers
to count much.
You should probably consider them as the 'minimum'
amount possible,
since the actual (real) values should always be equal
or greater
in all cases.
The
above are excerpts from the README file which accompanies
the webalizer software. The entire README file
can be found at www.webalizer.org
|