How to Parse URLs in Python: A Comprehensive Guide with Examples

Last Update: July 09, 2024

Parsing URLs (Uniform Resource Locators) is a common task in web development and data processing. URLs are the addresses that identify resources on the internet, such as web pages, images, and documents. Python provides powerful libraries and tools to easily parse and manipulate URLs, enabling developers to extract specific components and perform various operations on them.

Introduction to URL Parsing

URL parsing involves breaking down a URL into its constituent parts, such as the scheme (e.g., “http”), host (e.g., “www.example.com ”), path (e.g., “/page”), query parameters (e.g., “key=value”), and more. Python provides several libraries to facilitate URL parsing, each with its own set of features and capabilities.

Using the urllib.parse Module

The urllib.parse module is part of the Python standard library and offers functions to parse and manipulate URLs. It provides a comprehensive set of tools for handling URLs.

Example: Parsing a Simple URL

>>> from urllib.parse import urlparse
>>>
>>> url = "https://www.example.com/page?query=value"
>>> parsed_url = urlparse(url)

>>> print("Scheme:", parsed_url.scheme)
Scheme: https

>>> print("Netloc:", parsed_url.netloc)
Netloc: www.example.com

>>> print("Path:", parsed_url.path)
Path: /page

>>> print("Query:", parsed_url.query)
Query: query=value
>>>

Example: Encode Query Parameters with `urllib.parse.quote()` vs `urllib.parse.quote_plus()`

The urllib.parse.quote() and urllib.parse.quote_plus() functions from Python’s standard library are used to percent-encode strings, making them safe to use in URLs. Here’s how they differ:

urllib.parse.quote()

This function is used to percent-encode a string.
It replaces special characters in the string using the %xx escape. For example, a space is replaced with %20.
It’s useful when you need to encode a path segment or the part of a query string before the ?.

urllib.parse.quote_plus()

Similar to quote(), but it replaces space with + signs while quote() replace space with %20.
It replace / with %2F while quote() does not replace /.
This behavior is particularly useful when encoding form data in the query component of a URL, mimicking the behavior of encoding spaces as + in URLs, which is common in the context of form submissions.

urllib.parse.quote() vs urllib.parse.quote_plus() examples:

>>> import urllib.parse

>>> urllib.parse.quote("https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123")
'https%3A//example.com/path%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ%20123'

# `safe=''` also replace `/` with `%2F`
>>> urllib.parse.quote("https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123", safe='')
'https%3A%2F%2Fexample.com%2Fpath%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ%20123'

>>> urllib.parse.quote_plus("https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123")
'https%3A%2F%2Fexample.com%2Fpath%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ+123'

>>> url = "https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123"
>>> url = f"https://example.com?url={urllib.parse.quote_plus(url)}"
>>> url
'https://example.com?url=https%3A%2F%2Fexample.com%2Fpath%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ+123'

Example: Extracting Query Parameters

>>> from urllib.parse import parse_qs
>>>
>>> query_string = "key1=value1&key2=value2&key3=value3"
>>> query_params = parse_qs(query_string)
>>>
>>> for key, values in query_params.items():
...     print(key, ":", values)
...
key1 : ['value1']
key2 : ['value2']
key3 : ['value3']

Advanced URL Parsing with the furl Library

The furl library is a powerful and user-friendly option for parsing and manipulating URLs, including handling relative URLs.

pip3 install furl

Example: Handling Relative URLs

>>> from furl import furl
>>>
>>> base_url = "https://www.example.com/page/"
>>> relative_url = "../otherpage"
>>> absolute_url = furl(base_url).join(relative_url)
>>>
>>> print("Absolute URL:", absolute_url.url)
Absolute URL: https://www.example.com/otherpage

Example: Combining and Resolving URLs

>>> from furl import furl
>>>
>>> url1 = "https://www.example.com/page"
>>> url2 = "otherdir/otherpage"
>>> combined_url = furl(url1).join(url2)
>>>
>>> print("Combined URL:", combined_url.url)
Combined URL: https://www.example.com/otherdir/otherpage