Phishing URL Detection with ML (2023)

Phishing URL Detection with ML (1)

Phishing is a form of fraud in which the attacker tries to learn sensitive information such as login credentials or account information by sending as a reputable entity or person in email or other communication channels.

Typically a victim receives a message that appears to have been sent by a known contact or organization. The message contains malicious software targeting the user’s computer or has links to direct victims to malicious websites in order to trick them into divulging personal and financial information, such as passwords, account IDs or credit card details.

Phishing URL Detection with ML (2)

Phishing is popular among attackers, since it is easier to trick someone into clicking a malicious link which seems legitimate than trying to break through a computer’s defense systems. The malicious links within the body of the message are designed to make it appear that they go to the spoofed organization using that organization’s logos and other legitimate contents.

In this article I explain: phishing domain (or Fraudulent Domain) characteristics, the features that distinguish them from legitimate domains, why it is important to detect these domains, and how they can be detected using machine learning and natural language processing techniques.

Phishing URL Detection with ML (3)

Phishing URL Detection with ML (4)

(Video) Phishing detection using machine learning technique

Many users unwittingly click phishing domains every day and every hour. The attackers are targeting both the users and the companies. According to the 3rd Microsoft Computing Safer Index Report, released in February 2014, the annual worldwide impact of phishing could be very high as $5 billion.

What is the reason of this cost?

The main reason is the lack of awareness of users. But security defenders must take precautions to prevent users from confronting these harmful sites. Preventing these huge costs can start with making people conscious in addition to building strong security mechanisms which are able to detect and prevent phishing domains from reaching the user.

Lets check the URL structure for the clear understanding of how attackers think when they create a phishing domain.

Uniform Resource Locator (URL) is created to address web pages. The figure below shows relevant parts in the structure of a typical URL.

Phishing URL Detection with ML (5)

It begins with a protocol used to access the page. The fully qualified domain name identifies the server who hosts the web page. It consists of a registered domain name (second-level domain) and suffix which we refer to as top-level domain (TLD). The domain name portion is constrained since it has to be registered with a domain name Registrar. A Host name consists of a subdomain name and a domain name. An phisher has full control over the subdomain portions and can set any value to it. The URL may also have a path and file components which, too, can be changed by the phisher at will. The subdomain name and path are fully controllable by the phisher. We use the term FreeURL to refer to those parts of the URL in the rest of the article.

The attacker can register any domain name that has not been registered before. This part of URL can be set only once. The phisher can change FreeURL at any time to create a new URL. The reason security defenders struggle to detect phishing domains is because of the unique part of the website domain (the FreeURL). When a domain detected as a fraudulent, it is easy to prevent this domain before an user access to it.

Some threat intelligence companies detect and publish fraudulent web pages or IPs as blacklists, thus preventing these harmful assets by others is getting easier. (cymon, firehol)

The attacker must intelligently choose the domain names because the aim should be convincing the users,and then setting the FreeURL to make detection difficult. Lets analyse an example given below.

Phishing URL Detection with ML (6)

Although the real domain name is active-userid.com, the attacker tried to make the domain look like paypal.com by adding FreeURL. When users see paypal.com at the beginning of the URL, they can trust the site and connect it, then can share their sensitive information to the this fraudulent site. This is a frequently used method by attackers.

(Video) Phishing URL Detection Presentation

Other methods that are often used by attackers are Cybersquatting and Typosquatting.

Cybersquatting (also known as domain squatting), is registering, trafficking in, or using a domain name with bad faith intent to profit from the goodwill of a trademark belonging to someone else. The cybersquatter may offer selling the domain to a person or company who owns a trademark contained within the name at an inflated price or may use it for fraudulent purposes such as phishing. For example, the name of your company is “abcompany” and you register as abcompany.com. Then phishers can register abcompany.net, abcompany.org, abcompany.biz and they can use it for fraudulent purpose.

Typosquatting, also called URL hijacking, is a form of cybersquatting which relies on mistakes such as typographical errors made by Internet users when inputting a website address into a web browser or based on typographical errors that are hard to notice while quick reading. URLs which are created with Typosquatting looks like a trusted domain. A user may accidentally enter an incorrect website address or click a link which looks like a trusted domain, and in this way, they may visit an alternative website owned by a phisher.

A famous example of Typosquatting is goggle.com, an extremely dangerous website. Another similar thing is yutube.com, which is similar to goggle.com except it targets Youtube users. Similarly, www.airfrance.com has been typosquatted as www.arifrance.com, diverting users to a website peddling discount travel. Some other examples; paywpal.com, microroft.com, applle.com, appie.com.

There are a lot of algorithms and a wide variety of data types for phishing detection in the academic literature and commercial products. A phishing URL and the corresponding page have several features which can be differentiated from a malicious URL. For example; an attacker can register long and confusing domain to hide the actual domain name (Cybersquatting, Typosquatting). In some cases attackers can use direct IP addresses instead of using the domain name. This type of event is out of our scope, but it can be used for the same purpose. Attackers can also use short domain names which are irrelevant to legitimate brand names and don’t have any FreeUrl addition. But these type of web sites are also out of our scope, because they are more relevant to fraudulent domains instead of phishing domains.

Beside URL-Based Features, different kinds of features which are used in machine learning algorithms in the detection process of academic studies are used. Features collected from academic studies for the phishing domain detection with machine learning techniques are grouped as given below.

  1. URL-Based Features
  2. Domain-Based Features
  3. Page-Based Features
  4. Content-Based Features

URL-Based Features

URL is the first thing to analyse a website to decide whether it is a phishing or not. As we mentioned before, URLs of phishing domains have some distinctive points. Features which are related to these points are obtained when the URL is processed. Some of URL-Based Features are given below.

  • Digit count in the URL
  • Total length of URL
  • Checking whether the URL is Typosquatted or not. (google.com → goggle.com)
  • Checking whether it includes a legitimate brand name or not (apple-icloud-login.com)
  • Number of subdomains in URL
  • Is Top Level Domain (TLD) one of the commonly used one?

Domain-Based Features

The purpose of Phishing Domain Detection is detecting phishing domain names. Therefore, passive queries related to the domain name, which we want to classify as phishing or not, provide useful information to us. Some useful Domain-Based Features are given below.

  • Its domain name or its IP address in blacklists of well-known reputation services?
  • How many days passed since the domain was registered?
  • Is the registrant name hidden?

Page-Based Features

Page-Based Features are using information about pages which are calculated reputation ranking services. Some of these features give information about how much reliable a web site is. Some of Page-Based Features are given below.

  • Global Pagerank
  • Country Pagerank
  • Position at the Alexa Top 1 Million Site

Some Page-Based Features give us information about user activity on target site. Some of these features are given below. Obtaining these types of features is not easy. There are some paid services for obtaining these types of features.

  • Estimated Number of Visits for the domain on a daily, weekly, or monthly basis
  • Average Pageviews per visit
  • Average Visit Duration
  • Web traffic share per country
  • Count of reference from Social Networks to the given domain
  • Category of the domain
  • Similar websites etc.

Content-Based Features

Obtaining these types of features requires active scan to target domain. Page contents are processed for us to detect whether target domain is used for phishing or not. Some processed information about pages are given below.

  • Page Titles
  • Meta Tags
  • Hidden Text
  • Text in the Body
  • Images etc.

By analysing these information, we can gather information such as;

  • Is it required to login to website
  • Website category
  • Information about audience profile etc.

All of features explained above are useful for phishing domain detection. In some cases, it may not be useful to use some of these, so there are some limitations for using these features. For example, it may not be logical to use some of the features such as Content-Based Features for the developing fast detection mechanism which is able to analyze the number of domains between 100.000 and 200.000. Another example would be, if we want to analyze new registered domains Page-Based Features is not very useful. Therefore, the features that will be used by the detection mechanism depends on the purpose of the detection mechanism. Which features to use in the detection mechanism should be selected carefully.

(Video) Phishing Website Detection by Machine Learning Techniques

Detecting Phishing Domains is a classification problem, so it means we need labeled data which has samples as phish domains and legitimate domains in the training phase. The dataset which will be used in the training phase is a very important point to build successful detection mechanism. We have to use samples whose classes are precisely known. So it means, the samples which are labeled as phishing must be absolutely detected as phishing. Likewise the samples which are labeled as legitimate must be absolutely detected as legitimate. Otherwise, the system will not work correctly if we use samples that we are not sure about.

For this purpose, some public datasets are created for phishing. Some of the well-known one is PhishTank. These data sources are used commonly in academic studies.

Collecting legitimate domains is another problem. For this purpose, site reputation services are commonly used. These services analyse and rank available websites. This ranking may be global or may be country-based. Ranking mechanism depends on a wide variety of features. The websites which have high rank scores are legitimate sites which are used very frequently. One of the well-known reputation ranking service is Alexa. Researchers are using top lists of Alexa for legitimate sites.

When we have raw data for phishing and legitimate sites, the next step should be processing these data and extract meaningful information from it to detect fraudulent domains. The dataset to be used for machine learning must actually consist these features. So, we must process the raw data which is collected from Alexa, Phishtank or other data resources, and create a new dataset to train our system with machine learning algorithms. The feature values should be selected according to our needs and purposes and should be calculated for every one of them.

There so many machine learning algorithms and each algorithm has its own working mechanism. In this article, we have explained Decision Tree Algorithm, because I think, this algorithm is a simple and powerful one.

Initially, as we mentioned above, phishing domain is one of the classification problem. So, this means we need labeled instances to build detection mechanism. In this problem we have two classes: (1) phishing and (2) legitimate.

When we calculate the features that we’ve selected our needs and purposes, our dataset looks like in figure below. In our examples, we selected 12 features, and we calculated them. Thus we generated a dataset which will be used in training phase of machine learning algorithm.

A Decision Tree can be considered as an improved nested-if-else structure. Each features will be checked one by one. An example tree model is given below.

Generating a tree is the main structure of detection mechanism. Yellow and elliptical shaped ones represent features and these are called nodes. Green and angular ones represent classes and these are called leaves. The length is checked when an example arrives and then the other features are checked according to the result. When the journey of the samples is completed, the class that a sample belongs to will become clear.

Phishing URL Detection with ML (8)

(Video) Phishing URL Detection Using Machine Learning

Now, the most important question about Decision Trees is not answered yet. The question is that which feature will be located as the root? and which ones must come after the root? Choosing features intelligently effects efficiency and success rate of algorithms directly.

So, how does decision tree algorithm select features?

Decision Tree uses a information gain measure which indicates how well a given feature separates the training examples according to their target classification. The name of the method is Information Gain. The mathematical equation of information gain method is given below.

Phishing URL Detection with ML (9)

High Gain score means that the feature has a high distinguishing ability. Because of this, the feature which has maximum gain score is selected as the root. Entropy is a statistical measure from information theory that characterizes (im-)purity of an arbitrary collection S of examples. The mathematical equation of Entropy is given below.

Phishing URL Detection with ML (10)

Original Entropy is a constant value, Relative Entropy is changeable. Low Relative Entropy Score means high purity, likewise high Relative Entropy Score means low purity. As we move down the tree, we want to increase the purity, because high purity on the leaf implies high success rate.

In the training phase, dataset is divided into two parts by comparing the feature values. In our example we have 14 samples. “+” sign representing phishing class, and “-” sign representing legitimate class. We divided these samples into two parts according to the length feature. Seven of them settle right, the other seven of them settle left. As shown in the figure below, right part of tree has high purity, so it means low Entropy Score (E), likewise left part of tree has low purity and high Entropy Score (E). All calculations were done according to the equations given above. Information Gain Score about the length feature is 0,151.

The Decision Tree Algorithm calculates this information for every feature and selects features with maximum Gain scores. To growth the tree, leaves are changed as a node which represents a feature. As the tree grows downwards, all leaves will have high purity. When the tree is big enough, the training process is completed.

(Video) Detecting Malicious Urls with Machine Learning In Python

The Tree created by selecting the most distinguishing features represents model structure for our detection mechanism. Creating mechanism which has high success rate depends on training dataset. For the generalization of system success, the training set must be consisted of a wide variety of samples taken from a wide variety of data sources. Otherwise, our system may working with high success rate on our dataset, but it can not work successfully on real world data.

FAQs

How do I verify a phishing URL? ›

Check the Links: URL phishing attacks are designed to trick recipients into clicking on a malicious link. Hover over the links within an email and see if they actually go where they claim. Enter suspicious links into a phishing verification tool like phishtank.com, which will tell you if they are known phishing links.

What is phishing detection using machine learning? ›

In phishing detection, an incoming URL is identified as phishing or not by analysing the different features of the URL and is classified accordingly. Different machine learning algorithms are trained on various datasets of URL features to classify a given URL as phishing or legitimate.

What is phishing URL? ›

What is URL Phishing? Cybercriminals use phishing URLs to try to obtain sensitive information for malicious use, such as usernames, passwords, or banking details. They send phishing emails to direct their victims to enter sensitive information on a fake website that looks like a legitimate website.

Which of the following tool is used to detect phishing? ›

Cofense PDR (Phishing Detection and Response) is a managed service where both AI-based tools and security professionals are leveraged in concert to identify and mitigate phishing attacks as they happen.

Is there a safe way to open a suspicious link? ›

If you don't want to interact with the suspicious webpage and instead just quickly want to see what it is, the easiest and safest way to open the link is probably by using an online screen capturing service for websites (e.g., https://www.screenshotmachine.com or https://screenshot.guru).

How can I check to see if a website is safe? ›

A secure URL should begin with “https” rather than “http.” The “s” in “https” stands for secure, which indicates that the site is using a Secure Sockets Layer (SSL) Certificate. This lets you know that all your communication and data is encrypted as it passes from your browser to the website's server.

How does Google detect phishing? ›

We use this classifier to maintain Google's phishing blacklist automatically. Our classifier analyzes millions of pages a day, examining the URL and the contents of a page to determine whether or not a page is phishing.

What is machine learning? ›

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

Can you spoof a URL? ›

Website spoofing is when an attacker builds a website with a URL that closely resembles, or even copies, the URL of a legitimate website that a user knows and trusts. In addition to spoofing the URL, the attacker may copy the content and style of a website, complete with images and text.

How do phishing links work? ›

A majority of phishing links are sent via email and designed to fool the recipient into downloading a virus, giving up a credit card number, providing personal information (like a Social Security number) or offer account or login information to a particular website.

What does a phishing website look like? ›

A phishing website looks similar to the original one as cybercriminals copy the theme, information, graphics, and other intricate details. It may link some of the pages (like contact us or careers) to those of the original website. It often uses the name of the original website.

How can I check a link? ›

General Link Safety Tips
  1. Scan the Link With a Link Scanner.
  2. Turn on Real-Time or Active Scanning in Anti-Malware Software.
  3. Keep Your Anti-Malware and Antivirus Software Up to Date.
  4. Consider Adding a Second-Opinion Malware Scanner.
2 May 2022

What happens if you open a spam link? ›

What Happens If You Click on a Phishing Link? Clicking on a phishing link or opening an attachment in one of these messages may install malware, like viruses, spyware or ransomware, on your device. This is all done behind the scenes, so it is undetectable to the average user.

What is an example of a phishing link? ›

For example: The user is redirected to myuniversity.edurenewal.com, a bogus page appearing exactly like the real renewal page, where both new and existing passwords are requested. The attacker, monitoring the page, hijacks the original password to gain access to secured areas on the university network.

Videos

1. Machine Learning for Security Analysts - Part 3: Malicious URL Predictor
(Netsec Explained)
2. Phishing Sites Prediction Using Machine Learning
(Tarun Tiwari)
3. Detecting Phishing Websites using Machine Learning Technique
(Nevon Projects)
4. Malicious URL Detection Using Machine Learning in Python | NLP
(Wisdom ML)
5. Phishing URL Detection Using Machine Learning.
(Python-Bug)
6. Phishing Websites Detection System using Machine Learning Techniques | IEEE Machine Learning Project
(Ieee Xpert)
Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated: 11/17/2022

Views: 6037

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.