Regular Expression: A simple way for string manipulation
Logical operation on data especially in the web where we have HTML, CSS, Script in mixed form (mixed data), at the same time we need to find meaning full data with mixed nature of data is a complex operation. In order to have better performance doing so, we will have a regular expression which can help us.
What is Regular Expression?
A nice definition from Wikipedia is,
"A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern."
Why do we need this "Regular Expression"
Generally used in string manipulation or string searching algorithms, great use can be seen in "Search Engines". If our string has some regular pattern then regex comes as a solution.
How to Construct A Regular Expression
Though it is an easy one, require little knowledge of pattern and how to construct them so that our tool or programming language can understand it. So if you are new to regular exception then please learn little from w3schools.com
Regular expression in C#
As you know the basics of regular expressions. In C# we have special Regex Class in System.Text.RegularExpressions Namespace.
- While working with regular expressions in C#, it's good to use verbatim strings instead of regular strings.
- A recommended alternative of repeatedly instantiating a regular expression is a Static regular expression.
- By default only 15 most recently used static regular expressions are in the cache.
- If our application uses more than 15 static regular expressions, then some of the regular expressions must be recompiled. To prevent this recompilation, we can increase the Regex.CacheSize property.
- If the regular expression engine cannot identify a match within the time-out interval, then the matching operation throws a RegexMatchTimeoutException exception.
- We can set a time-out interval by calling the Regex(String, RegexOptions, TimeSpan) constructor when we instantiate a regular expression object. For more we can check https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices?view=netframework-4.7.2
For more details of regular expressions in C#, please visit the link https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions?redirectedfrom=MSDN&view=netframework-4.7.2
Note:- If our goal is to validate a string by checking whether it is in a particular pattern, then we can use the "System.Configuration.RegexStringValidator" Class.
Some points to know before we start using Regular expression are as follows,
RegexOptions Enums
- RegexOptions.Compiled
- Increase performance by matching multiple input strings with the same pattern
- RegexOptions.IgnoreCase
- Makes the pattern case insensitive
- RegexOptions.Multiline
- It is necessary if our input string has newline (\n) & allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line.
- RegexOptions.RightToLeft
- This is useful for matching in RTL languages
- RegexOptions.Singleline
- Option allows that the (.) metacharacter match all characters, including the newline (\n)
How to Remove CSS, Script and HTML from string
In this section, I am trying to show how we can use regular expression in a different way especially if you working in web data then it may help you more,
/// <summary>
/// Uses Two step, one is for Script and CSS and One for HTML
/// </summary>
/// <param name="input">string with HTML tags</param>
/// <returns>string of Plain text</returns>
public string StripAllHTMLCssJavaScript(string input)
{
Regex reScriptCss = new
Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
Regex reStripHTML = new
Regex("<.*?>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
input = reScriptCss.Replace(input, string.Empty);
input = reStripHTML.Replace(input, string.Empty);
return input;
}
How can we do the same thing little different way, It is nothing but the addition of two regular expressions in one expression.
/// <summary>
/// This method uses multiple condition to search and replace "Script, Style and HTML tags"
/// </summary>
/// <param name="input">string with HTML tags</param>
/// <returns>string of plain Text</returns>
public string StripAllHTMLCssJavaScriptLittleBetter(string input)
{
Regex reScriptCssHTML = new
Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>|(<.*?>))",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);
input = reScriptCssHTML.Replace(input, string.Empty);
return input;
}
A sample how to return Set of Link in given HTML string in the form of ISet<string>
/// <summary>
/// Returns ISet<string> for Given HTML string
/// </summary>
/// <param name="content">A HTML string</param>
/// <param name="linkType">Can be "link" or "imageLink"</param>
/// <returns>ISet<string></returns>
public ISet<string> GetNewLinks(string content, string linkType)
{
Regex regexLink = GetRegex(linkType);
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
string rawLink = match.ToString();
if (!newLinks.Contains(rawLink))
newLinks.Add(rawLink);
}
return newLinks;
}
How we can create a bunch of Regex collection and can use them if we have more work on it,
/// <summary>
/// Returns a Regex as given element,
/// As it is sample it has only two, we can add as many as we have
/// </summary>
/// <param name="htmlElement">Can be "link" or "imageLink" for this case</param>
/// <returns>Regex</returns>
private Regex GetRegex(string htmlElement= "link")
{
Regex regexLink = null;
switch (htmlElement)
{
case "link":
regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
break;
case "imageLink":
regexLink = new Regex("<img.+?src=[\"'](.+?)[\"'].*?>");
break;
default:
break;
}
return regexLink;
}
A full Class Code look like
public class RegHelper
{
/// <summary>
/// Get HTML of given URL with async, if web page allow
/// </summary>
/// <param name="webUrl">A valid web URL</param>
/// <returns>Task<Tuple<string, ISet<string>>></returns>
public async Task<Tuple<string, ISet<string>>> GetCrawlerDataAsync(string webUrl = "https://www.ttmind.com/")
{
WebRequest newWebRequest;
WebResponse newWebResponse;
newWebRequest = WebRequest.Create(webUrl);
newWebResponse = await newWebRequest.GetResponseAsync();
//return the data stream from the internet and save it in the stream
Stream streamResponse = newWebResponse.GetResponseStream();
//reads the data stream
StreamReader sreader = new StreamReader(streamResponse);
//reads it to the end
string newString = sreader.ReadToEnd();
//gets the links only
ISet<string> Links = GetNewLinks(newString, "link");
//Seting Value to return multipe value
Tuple<string, ISet<string>> tupleData = new Tuple<string, ISet<string>>(newString, Links);
return tupleData;
}
/// <summary>
/// Get HTML of given URL, if web page allow
/// </summary>
/// <param name="webUrl">A valid web URL</param>
/// <returns>Tuple<string, ISet<string>></returns>
public Tuple<string, ISet<string>> GetCrawlerData(string webUrl = "https://www.ttmind.com/")
{
WebRequest newWebRequest;
WebResponse newWebResponse;
newWebRequest = WebRequest.Create(webUrl);
newWebResponse = newWebRequest.GetResponse();
//return the data stream from the internet and save it in the stream
Stream streamResponse = newWebResponse.GetResponseStream();
//reads the data stream
StreamReader sreader = new StreamReader(streamResponse);
//reads it to the end
string newString = sreader.ReadToEnd();
//gets the links only
ISet<string> Links = GetNewLinks(newString, "link");
//Seting Value to return multipe value
Tuple<string, ISet<string>> tupleData = new Tuple<string, ISet<string>>(newString, Links);
return tupleData;
}
/// <summary>
/// Uses Two step, one is for Script and CSS and One for HTML
/// </summary>
/// <param name="input">string with HTML tags</param>
/// <returns>string of Plain text</returns>
public string StripAllHTMLCssJavaScript(string input)
{
Regex reScriptCss = new
Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
Regex reStripHTML = new
Regex("<.*?>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
input = reScriptCss.Replace(input, string.Empty);
input = reStripHTML.Replace(input, string.Empty);
return input;
}
/// <summary>
/// This method uses multipe condition to searcha and replace "Script, Style and HTML tags"
/// </summary>
/// <param name="input">string with HTML tags</param>
/// <returns>string of plain Text</returns>
public string StripAllHTMLCssJavaScriptLittleBetter(string input)
{
Regex reScriptCssHTML = new
Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>|(<.*?>))",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.Compiled);
input = reScriptCssHTML.Replace(input, string.Empty);
return input;
}
/// <summary>
/// Returns ISet<string> for Given HTML string
/// </summary>
/// <param name="content">A HTML string</param>
/// <param name="linkType">Can be "link" or "imageLink"</param>
/// <returns>ISet<string></returns>
public ISet<string> GetNewLinks(string content, string linkType)
{
Regex regexLink = GetRegex(linkType);
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
string rawLink = match.ToString();
if (!newLinks.Contains(rawLink))
newLinks.Add(rawLink);
}
return newLinks;
}
/// <summary>
/// Returns a Regex as given element,
/// As it is sample it has only two, we can add as many as we have
/// </summary>
/// <param name="htmlElement">Can be "link" or "imageLink" for this case</param>
/// <returns>Regex</returns>
private Regex GetRegex(string htmlElement = "link")
{
Regex regexLink = null;
switch (htmlElement)
{
case "link":
regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
break;
case "imageLink":
regexLink = new Regex("<img.+?src=[\"'](.+?)[\"'].*?>");
break;
default:
break;
}
return regexLink;
}
}
To call the above code in our Console Application,
class Program
{
static void Main(string[] args)
{
PrintResults();
Console.ReadLine();
}
private async static void PrintResults()
{
RegHelper cH = new RegHelper();
Tuple<string, ISet<string>> objData = await cH.GetCrawlerDataAsync("https://github.com/");
Console.WriteLine("Orginal HTML String \n");
Console.WriteLine(objData.Item1);
Console.WriteLine("Links \n");
foreach (string lnk in objData.Item2)
{
Console.WriteLine(lnk);
}
}
}
Hope all these examples can help little more if you doing some string manipulation or doing some pattern searching etc. Do comments your ideas as well so someone can get more help while using regular exception.