a method to capture just the root domain including subdomain from a url in python

There are a few ways to extract the root domain from a URL in Python. Here are two common methods:

Using urlparse

main.py
from urllib.parse import urlparse

def get_root_domain(url):
    parsed_uri = urlparse(url)
    domain = '{uri.netloc}'.format(uri=parsed_uri)
    return '.'.join(domain.split('.')[-2:])
187 chars
7 lines

The urlparse function from Python's built-in urllib module can be used to parse the URL into its various components, including the network location (i.e. the domain name). We split the domain name into its component parts by splitting on the period (.), and then join the last two parts (the root domain and the TLD) with a period to get the full root domain.

Using regex

main.py
import re

def get_root_domain(url):
    pattern = r"(?<=://)([a-zA-Z0-9\-\.]+)"
    root_domain = re.search(pattern, url).group()
    return '.'.join(root_domain.split('.')[-2:])
180 chars
7 lines

This method uses a regular expression to extract the domain name from the URL. The pattern looks for text that comes after :// (indicating the start of the domain name) and before the first forward slash (/). We then split the domain name as before and return the full root domain.

Note: these methods only extract the root domain and do not take into account other domain levels, such as sub-subdomains.

gistlibby LogSnag