How Do I Use Beautifulsoup to Read Xml Data
Data is literally everywhere, in all kinds of documents. Just not all of it is useful, hence the demand to parse information technology to get the parts that are needed. XML documents are ane of such documents that hold data. They are very similar to HTML files, equally they accept almost the aforementioned kind of structure. Hence, you'll need to parse them to get vital data, just as you would when working with HTML.
There are 2 major aspects to parsing XML files. They are:
- Finding Tags
- Extracting from Tags
You lot'll need to observe the tag that holds the data you want, then extract that information. You lot'll learn how to practice both when working with XML files earlier the finish of this article.
Installation
BeautifulSoup is one of the most used libraries when it comes to web scraping with Python. Since XML files are similar to HTML files, it is likewise capable of parsing them. To parse XML files using BeautifulSoup though, it'south all-time that you make utilise of Python's lxml parser.
You can install both libraries using the pip installation tool, through the command below:
To ostend that both libraries are successfully installed, you lot tin activate the interactive shell and try importing both. If no error pops upwardly, then you are ready to go with the residuum of the article.
Hither's an example:
$python
Python 3.7.4 (tags/v3.vii.4:e09359112e, Jul 8 2019 , 20:34:20 )
[MSC v.1916 64 bit (AMD64) ] on win32
Type "help" , "copyright" , "credits" or "license" for more data.
>>> import bs4
>>> import lxml
>>>
Before moving on, yous should create an XML file from the lawmaking snippet beneath. It's quite simple, and should suit the utilise cases you'll larn nearly in the rest of the article. Simply copy, paste in your editor and save; a name like sample.xml should suffice.
<?xml version = "1.0" encoding= "UTF-8" standalone= "no"?>
<root testAttr= "testValue">
The Tree
<children>
<kid name = "Jack">First</child>
<kid proper noun = "Rose">Second</child>
<child name = "Blue Ivy">
Third
<grandchildren>
<information>One</data>
<information>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>
<child proper name = "Jane">Fourth</kid>
</children>
</root>
Now, in your Python script; you'll need to read the XML file like a normal file, and so pass it into BeautifulSoup. The rest of this article will make utilise of the bs_content variable, so it's important that you take this pace.
# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = [ ]
# Read the XML file
with open ( "sample.xml" , "r" ) as file:
# Read each line in the file, readlines() returns a listing of lines
content = file.readlines ( )
# Combine the lines in the list into a string
content = "".join (content)
bs_content = bs(content, "lxml" )
The code sample above imports BeautifulSoup, then it reads the XML file like a regular file. Later that, it passes the content into the imported BeautifulSoup library also equally the parser of selection.
You'll discover that the code doesn't import lxml. It doesn't accept to as BeautifulSoup volition cull the lxml parser as a result of passing "lxml" into the object.
Now, you can proceed with the balance of the commodity.
Finding Tags
I of the most of import stages of parsing XML files is searching for tags. There are diverse ways to go about this when using BeautifulSoup; so you need to know near a handful of them to accept the all-time tools for the appropriate situation.
Yous tin notice tags in XML documents past:
- Names
- Relationships
Finding Tags By Names
There are two BeautifulSoup methods you tin can use when finding tags by names. Yet, the use cases differ; let's take a look at them.
find
From personal experience, you'll use the notice method more than oftentimes than the other methods for finding tags in this article. The find tag receives the name of the tag yous want to get, and returns a BeautifulSoup object of the tag if information technology finds one; else, it returns None.
Here'south an instance:
>>> result = bs_content.find ( "data" )
>>> print (event)
<information>One</data>
>>> event = bs_content.notice ( "unique" )
>>> print (event)
<unique>Twins</unique>
>>> consequence = bs_content.find ( "father" )
>>> print (consequence)
None
>>> result = bs_content.find ( "mother" )
>>> print (result)
None
If you take a look at the example, yous'll run into that the find method returns a tag if it matches the proper noun, else it returns None. However, if yous take a closer await at information technology, you'll see information technology only returns a single tag.
For case, when discover("data") was called, it only returned the offset data tag, just didn't return the other ones.
GOTCHA: The find method will only render the commencement tag that matches its query.
And then how do y'all get to find other tags too? That leads usa to the next method.
find_all
The find_all method is quite similar to the discover method. The only divergence is that it returns a list of tags that lucifer its query. When it doesn't discover whatever tag, it simply returns an empty listing. Hence, find_all will always return a list.
Here's an example:
>>> upshot = bs_content.find_all ( "data" )
>>> print (issue)
[ <data>One</data>, <data>Two</data> ]
>>> event = bs_content.find_all ( "child" )
>>> print (result)
[ <child>Kickoff</child>, <child>Second</child>, <child>
Third
<grandchildren>
<data>1</data>
<data>Two</information>
<unique>Twins</unique>
</grandchildren>
</child>, <kid>Fourth</child> ]
>>> event = bs_content.find_all ( "father" )
>>> impress (result
[ ]
>>> outcome = bs_content.find_all ( "mother" )
>>> print (result)
[ ]
Now that you know how to employ the notice and find_all methods, you can search for tags anywhere in the XML document. All the same, yous tin can make your searches more powerful.
Here's how:
Some tags may have the aforementioned name, but different attributes. For example, the child tags have a proper noun attribute and different values. You can make specific searches based on those.
Accept a look at this:
>>> outcome = bs_content.find ( "child" , { "name": "Rose" } )
>>> impress (outcome)
<child name= "Rose" >Second</child>
>>> result = bs_content.find_all ( "kid" , { "proper noun": "Rose" } )
>>> impress (result)
[ <child name= "Rose" >Second</child> ]
>>> result = bs_content.discover ( "kid" , { "name": "Jack" } )
>>> print (effect)
<kid name= "Jack" >First</child>
>>> result = bs_content.find_all ( "child" , { "proper name": "Jack" } )
>>> print (result)
[ <kid name= "Jack" >Commencement</child> ]
You'll see that at that place is something different near the use of the find and find_all methods here: they both accept a second parameter.
When you lot laissez passer in a dictionary as a 2d parameter, the notice and find_all methods further their search to get tags that have attributes and values that fit the provided key:value pair.
For instance, despite using the discover method in the beginning case, it returned the 2nd child tag (instead of the beginning child tag), considering that's the first tag that matches the query. The find_all tag follows the same principle, except that it returns all the tags that friction match the query, not just the first.
Finding Tags By Relationships
While less popular than searching by tag names, y'all can too search for tags by relationships. In the existent sense though, it'south more of navigating than searching.
There are three cardinal relationships in XML documents:
- Parent: The tag in which the reference tag exists.
- Children: The tags that exist in the reference tag.
- Siblings: The tags that exist on the aforementioned level as the reference tag.
From the explanation higher up, you may infer that the reference tag is the nearly important factor in searching for tags by relationships. Hence, allow's look for the reference tag, and continue the article.
Take a await at this:
>>> third_child = bs_content.notice ( "child" , { "proper name": "Bluish Ivy" } )
>>> print (third_child)
<kid name= "Bluish Ivy" >
3rd
<grandchildren>
<data>One</data>
<data>Ii</data>
<unique>Twins</unique>
</grandchildren>
</child>
From the code sample above, the reference tag for the rest of this section volition be the 3rd child tag, stored in a third_child variable. In the subsections below, you'll come across how to search for tags based on their parent, sibling, and children relationship with the reference tag.
Finding Parents
To observe the parent tag of a reference tag, yous'll make use of the parent attribute. Doing this returns the parent tag, every bit well equally the tags under it. This behaviour is quite understandable, since the children tags are part of the parent tag.
Here'southward an example:
>>> event = third_child.parent
>>> print (result)
<children>
<child name= "Jack" >Outset</child>
<kid name= "Rose" >Second</child>
<child proper name= "Blue Ivy" >
Third
<grandchildren>
<data>One</information>
<information>Two</information>
<unique>Twins</unique>
</grandchildren>
</child>
<kid proper noun= "Jane" >Fourth</child>
</children>
Finding Children
To observe the children tags of a reference tag, you lot'll make use of the children attribute. Doing this returns the children tags, too every bit the sub-tags under each one of them. This behaviour is also understandable, as the children tags often take their ain children tags too.
One matter y'all should note is that the children attribute returns the children tags every bit a generator. So if you need a list of the children tags, y'all'll have to convert the generator to a list.
Hither's an example:
>>> result = listing (third_child.children )
>>> impress (result)
[ '\n Third\n ' , <grandchildren>
<data>One</information>
<information>Two</data>
<unique>Twins</unique>
</grandchildren>, '\north' ]
If yous take a closer expect at the example to a higher place, y'all'll discover that some values in the list are not tags. That's something you need to lookout man out for.
GOTCHA: The children attribute doesn't only return the children tags, information technology as well returns the text in the reference tag.
Finding Siblings
The last in this department is finding tags that are siblings to the reference tag. For every reference tag, at that place may exist sibling tags before and later it. The previous_siblings aspect volition return the sibling tags before the reference tag, and the next_siblings attribute volition return the sibling tags afterwards it.
Just like the children aspect, the previous_siblings and next_siblings attributes volition return generators. So yous demand to convert to a list if you need a listing of siblings.
Take a await at this:
>>> previous_siblings = list (third_child.previous_siblings )
>>> print (previous_siblings)
[ '\n' , <child proper name= "Rose" >Second</child>, '\n' ,
<child proper noun= "Jack" >Commencement</child>, '\northward' ]
>>> next_siblings = list (third_child.next_siblings )
>>> print (next_siblings)
[ '\n' , <child proper noun= "Jane" >Fourth</kid> ]
>>> print (previous_siblings + next_siblings)
[ '\n' , <kid name= "Rose" >Second</kid>, '\n' , <child name= "Jack" >Start</child>,
'\n' , '\n' , <child proper name= "Jane" >Fourth</child>, '\n' ]
The first example shows the previous siblings, the second shows the next siblings; then both results are combined to generate a list of all the siblings for the reference tag.
Extracting From Tags
When parsing XML documents, a lot of the work lies in finding the right tags. Yet, when you detect them, you may also want to extract certain data from those tags, and that's what this department volition teach you.
You'll see how to excerpt the post-obit:
- Tag Attribute Values
- Tag Text
- Tag Content
Extracting Tag Aspect Values
Sometimes, you may accept a reason to extract the values for attributes in a tag. In the following attribute-value pairing for example: name="Rose", you may desire to extract "Rose."
To practise this, you can brand use of the get method, or accessing the attribute's name using [] similar an alphabetize, merely as y'all would when working with a dictionary.
Here's an example:
>>> upshot = third_child.get ( "name" )
>>> print (outcome)
Blue Ivy
>>> result = third_child[ "name" ]
>>> print (effect)
Blue Ivy
Extracting Tag Text
When you desire to access the text values of a tag, you can use the text or strings aspect. Both volition return the text in a tag, and even the children tags. Notwithstanding, the text attribute will return them as a single string, concatenated; while the strings attribute will render them every bit a generator which you can convert to a list.
Here's an example:
>>> result = third_child.text
>>> impress (result)
'\n Third\n \due northOne\nIi\nTwins\n \n'
>>> result = list (third_child.strings )
>>> print (result)
[ '\n Third\due north ' , '\n' , 'One' , '\n' , '2' , '\n' , 'Twins' , '\n' , '\n' ]
Extracting Tag Content
Asides extracting the attribute values, and tag text, yous tin as well extract all of a tags content. To do this, you can utilise the contents attribute; information technology is a fleck similar to the children attribute and will yield the aforementioned results. However, while the children attribute returns a generator, the contents attribute returns a list.
Hither's an case:
>>> result = third_child.contents
>>> print (effect)
[ '\n 3rd\n ' , <grandchildren>
<information>1</data>
<data>Ii</data>
<unique>Twins</unique>
</grandchildren>, '\n' ]
Printing Beautiful
Then far, you've seen some important methods and attributes that are useful when parsing XML documents using BeautifulSoup. Only if you detect, when you print the tags to the screen, they have some kind of amassed look. While appearance may not have a straight bear upon on your productivity, information technology can help you parse more finer and make the piece of work less tedious.
Here's an instance of printing the normal way:
>>> print (third_child)
<child proper name= "Blue Ivy" >
3rd
<grandchildren>
<data>One</data>
<information>2</data>
<unique>Twins</unique>
</grandchildren>
</kid>
Even so, you tin can improve its appearance by using the prettify method. Only telephone call the prettify method on the tag while printing, and you'll go something visually pleasing.
Take a expect at this:
Decision
Parsing documents is an important attribute of sourcing for data. XML documents are pretty popular, and hopefully you are better equipped to accept them on, and extract the information you desire.
From this article, you are now able to:
- search for tags either by names, or relationships
- excerpt information from tags
If you feel quite lost, and are pretty new to the BeautifulSoup library, y'all can cheque out the BeautifulSoup tutorial for beginners.
lanierpurabbighty.blogspot.com
Source: https://linuxhint.com/parse_xml_python_beautifulsoup/
0 Response to "How Do I Use Beautifulsoup to Read Xml Data"
Post a Comment