Posts Probably



The foundation of my thesis involves a product that crawls websites and monitors the ingress and egress traffic from specified pages. This traffic can be grouped, categorized, filtered, and have rules that determine the classification of the traffic. This can be driven by information in the URL such as domain, path, and query parameters, and other enriching data found in the meta tags of the domain. This meta information isn’t always available, but the more URLs crawled and classified, manually or automatically, the more accurate the system can become.

Bayes’ Theorem is the foundation of this automated classification system.

Bayes’ Theorem

The foundation of Bayes’ Theorem is conditional probability. For example, the probability that event “A” will occur, given that event “B” has already occurred is equal to the probability that events A and B will occur, divided by the probability that event “B” will occur.

\[P(A|B) = {P(A\ and\ B) \over P(B)}\]

The probability of “B” given that “A” has occurred is the probability that events “B” and “A” have occurred, divided by the probability of “A”.

\[P(B|A) = {P(B\ and\ A) \over P(A)}\]

So the probability that events “A” and “B” occur is equal to the probability that events “B” and “A” occur.

\[P(A\ and\ B) = P(B\ and\ A)\]

Then the probability of events “A” and “B” is the probability of “A” given “B” has already occurred multiplied by the probability of “B”. The same can be applied to the probability of events “B” and “A”.

\[P(A\ and\ B) = P(A|B)P(B)\] \[P(B\ and\ A) = P(B|A)P(A)\]

Resulting with the statement:

\[P(A|B)P(B) = P(B|A)P(A)\]

If we’re wanting to determine the conditional probability that event “A” will occur given event “B” has occurred, we can balance the equation by dividing both sides by the probability of event “B” will occur.

\[P(A|B) = { P(B|A)P(A) \over P(B) }\]

This is Bayes Theorem. If we know the reverse conditional probability, we can calculate the probability of an event ( and some other probability details ).

It is also written as the probability that “B” intersects with “A”.

\[P(A|B) = { P(B \cap A) \over P(B) }\]
  • “A” nor “B” is 0.

Bayes Alternative Form

When working with two competing statements, such as classifying an email as “spam” or “ham”, we generally use Bayes’ Alternate Form. The probability that event “B” occurs needs to account for both when the event “A” occurs, as well as when it does not.

Generic Probability Tree Diagram

So the alternative form of Bayes’ Theorem is the probability that “B” intersects with “A” divided by the probability that “B” intersects with “A” plus the probability that “B” intersects with “A” prime.

\[P(A|B) = { P(B \cap A) \over { P(B \cap A) + P(B \cap A') } }\]


\[P(A|B) = {P(B|A)P(A) \over { P(B|A)P(A) + P(B|A')P(A') }}\]

Binary Classification: Feature Vector

Bayes Theorem, as written, works great for when working with two competing statements. This can be further expanded by adding more classification options, other than spam and ham, as well as an extensive vocabulary.

There are two approaches to this: binary feature representation or integer based ( frequency of a word or n-gram ) bag of words. Integer-based representation make sense in the context of parsing documents, it can give an acceptable representation of what a document represents. However, URL’s are often small and have little context. Most meta keyword fields will contain a max of 10 keywords and descriptions of 150 characters. We’ll also remove “stop words” such as “a”, “to”, “be” etc that don’t contribute to the context. This doesn’t leave us a lot to work with, so a binary vector has the potential to deliver a simplified classifier with decent accuracy.

Using N-grams can also contribute to the overall accuracy of the classifier as N-grams such as “ads google”, “search google”, and “shopping google” can point to as being a shopping/advertising/search engine. This doesn’t change the overall classifier, but just how we parse the data.

The Classifier

The classifier is a Bernoulli-based model represented by a binary vector. A set of words, “V”, represents our total vocabulary in the model. We’ll have the variable “b” be the binary vector stating whether or not a word is found in a document “D.” “w” will represent a word, and “t” will be nth element in the binary vector.

We’ll define “C” sub “k” as the Class “k” we’re checking the model against and “w” sub “t” as the current word we’re checking the probability of existing in Class “k.”

The probability that a word is classified as “C” sub “k” is the number of documents in Class “k” with “w” sub “t” observed divided by the total number of documents with Class “k.”

Building off of alternate Bayes, we also want to account for the probability of a word not existing in Class “k.” This can be written as 1 minus the probability of “w” sub “t” given “C” sub “k”.

We then take the product of the resulting probabilities to get the probability that the document is classified as “C” sub “k.”

\[P(D|C_k) = P(b|C_k) = \prod_{t=1}^{|V|} b_t P(w_t|C_k) + (1 - b_t)(1 - P(w_t|C_k))\]


\[P(w_t|C_k) = {n_k(w_t) \over {N_k}}\]

And “N” sub “k” is the total number of documents for Class “k” and “n” sub “k” with “w” sub “t” be the number of documents of class “k” in which the word “w” sub “t” was found.


I’ll have to experiment with various approaches on generating the vocabulary – n-grams vs single words. I may incorporate some weighted component for cases in which other organizations manually classify the network traffic to improve the overall accuracy of the model.

This isn’t intended to be a definitive classification system which imposes the categorization on businesses, but a recommendation.

Until next time!

This post is licensed under CC BY 4.0 by the author.