Programming Python

How I built Decodify? Part 2 : Having Fun With Regex & Loops

Hi there! I am Tony Stark *coughs* Somdev Sangwan and this whole paragraph means absolutely nothing. I suggest you to start reading from the next paragraph because I wrote this for wasting your time *evil laugh* I am such a bad guy. Are you still reading this? Are you thinking what kind of website is this? Do you think I am stupid? LOL you are the moron here who is wasting his time in reading this useless shit instead of skipping to the next paragraph and learning something new. Marvel is better than DC. You know what? Lets get started.

Hi there! I hope you have read the previous article about Decodify where wrote the code to decode caesar cipher. In this chapter, we will be making Decodify what it was meant to be. Lets go!

Aim #1 Detect & Decode

So I wanted Decodify to detect and decode encodings such as Base64, URL, Hex, Binary etc. So I started with an easy one, URL encoding.

def urle(string):
   string = unquote(string)
   print ' Decoded from URL encoding : %s' % (string)
   quit()
url = search(r'%..%..', string)
if url:
   urle(string)
else:
   print 'This string isn\'t URL encoded'

So this is code. Let me break down the it line by line:

  • We are defining a function named urle which accepts the argument named string.
  • unquote() is built-in method in python which can decode URL encoding so unquote(string) just decodes the URL encoded string.
  • Then we are just printing the decoded result
  • quit() terminates/exits/quits the program.
  • url = search(r’%..%..+’, string) Well this is some Regex stuff. search is used to search something, r states that its a regex pattern, string is the string we want to search in. But what about the %..%.. ? Well this is a regex pattern, what it basically does it to look for any string which have this pattern %..%..+. As we know, a URL encoded string looks like this “%6c%75%6e%64” So the scheme is 2 characters after a % sign. Now lets come back to our regex pattern, %..%.. will match any string which has  % sign followed by any two characters followed by a % sign followed by any two characters. Got it? Cool. If a match is found, true is returned otherwise false.
  • if url: checks if the value of url is true. If it is true, it calls the urle() function with the argument string and we all not what urle() does.
  • else if the value of url is not true, we print “This string isn’t URL encoded

Alright, so now Decodify can detect and decode URL encoding but there are many encodings that Decodify should be able to decode. I won’t explain them all in detail as I did with this one because aim of this tutorial is to introduce you to new things and show you how to get shit done.

So the next target is binary. A Regex pattern to detect binary would be “^[01]+$. Here, [01] means that we want to look for only these two characters, + means we want to do it continuously. If we don’t use + it will just match the first occurrence of 0 or 1. ^ means the start of the string and $ means the end of the string. So this regex pattern will match any string which has only 0 and 1 in it, from the start to the end.
The next one is Base64. This is a Base64 encoded string: Z2FhbmQgbWUga2h1amxpIGg/
The code for detecting it would be:

b64 = search(r'^[A-Za-z0-9+\/=]+$', string)
if len(string)%4 == 0 and b64:
    print 'This is Base64 Encoding'

So the regex pattern here is, ^[A-Za-z0-9+\/=]+$. Again the ^ is the start of the string, + is for searching continuously and $ is for end of the string. A-Za-z0-9+\/= are the characters we want to look for. A-Z represents all alphabets in uppercase, a-z represents all alphabets in lowercase, 0-9 for the digits and +\/= are the additional characters we want to look for. Easy right? But whats up with the second line of the code?
Well, len(string) will return the length of string and % is used to find the remainder. So we are dividing the length of the string by 4 and then check if remainder is 0. So we are we doing maths here? Well there’s a thing about Base64, length of the encoded string is always a multiple of 4. So if you encode a string with Base64 its length will be 4 or 8 or 12 or 16 or 20 and so on…

We are using two conditions in the if statement, one is that length thing and other is our regex thing i.e. b64. I am using and between them because I want to make sure the string’s length is multiple of 4 and it satisfies our regex pattern too.

Next is StringFromChar encoding. Here’s a encoded text: String.fromCharCode(102, 117, 99, 107, 121, 101, 97, 104)

For the other encodings, there were inbuilt python functions or libraries to decode them but for this encoding I had to write my own method to decode it. Before we look at that method, lets take a look at how we can detect it with regex:

jv_char = search(r'\d+, \d+,', string)
if jv_char:
    print 'This is StringFromChar Encoding'

The regex string is \d+, \d+,. \d means a digit and + is again for continuous. So this pattern looks for any number of digits separated by , and .
And the code I wrote to decode StringFromChar Encoding is:

def fromchar(string):
    string = string.strip('String.fromCharCode(').strip(')').strip(' ')
    jv_list = string.split(',')
    decoded = []
    for i in jv_list:
        i = i.replace(' ', '').replace('97', 'a').replace('98', 'b').replace('99', 'c').replace('100', 'd').replace('101', 'e').replace('102', 'f').replace('103', 'g').replace('104', 'h').replace('105', 'i').replace('106', 'j').replace('107', 'k').replace('108', 'l').replace('109', 'm').replace('110', 'n').replace('111', 'o').replace('112', 'p').replace('113', 'q').replace('114', 'r').replace('115', 's').replace('116', 't').replace('117', 'u').replace('118', 'v').replace('119', 'w').replace('120', 'x').replace('121', 'y').replace('122', 'z')
        decoded.append(i)
    string = ''.join(decoded)
    print ' Decoded from FromChar : %s'%(string)

Lemme break down the code:

  • Firstly, we defined a function fromchar which accept string as an argument.
  • Then we are using strip() which is used to delete something from a string. So if you strip 0 from 101000, it will become 11. So yeah we are using strip to remove StringfromCharCode(, spaces and ). string.strip(‘haha’) means we are deleting haha from string. Gotcha?
  • string.split(‘,’) transform 12,9,175,01,3 into [‘12′, ‘9’, ‘175’, ’01’, ‘3]. Yep! Yep! You specify a character(s) to split() and it will use it as a divider and makes a list where the divided characters become the elements of the list. For example, hahaxwowxlolxy’.split(‘x’) will result in [‘haha’, ‘wow’, ‘lol’, ‘y’]. So we this list is assigned is variable jv_list.
  • decoded is a blank list which we will be using to store decoded elements.
  • In the next line, we started a for loop which iterates over the list jv_list.
  • Let say there’s string named speech and you want to replace all occurrence of haha to lol then you can simply do speech.replace(‘haha’, ‘lol’). We are doing the same shit in here. We are replacing the encoded values with their respective decoded values.
  • decoded.append(i) adds the value of variable i to the list decoded.
  • ”.join(decoded) converts the elements of the list decoded to a string by combining. We have done this before.
  • In the last line we finally print the decoded string.

Note: This code is for demo so I am replacing the values of lowercase alphabets here but the real Decodify replaces a-z, A-Z, 0-9 and all special characters printed on your keyboard.

So far our code looks like this:

string = raw_input('Enter the string you want to decode: ')
def decode(string):
    def binary(string):
        #decoding happens here
    bina = search(r'^[01]+$', string)
    if bina:
        binary(string)
    def fromchar(string):
        #decoding happens here
    jv_char = search(r'\d+, \d+,', string)
    elif jv_char:
        fromchar(string)
    def urle(string):
        #decoding happens here
    url = search(r'%..%..+', string)
    elif url:
        urle(string)
    def base64(string):
        #decoding happens here
    b64 = search(r'[A-Za-z0-9+\/=]+', string)
    elif len(string)%4 == 0 and b64:
        base64(string)
    else:
        print 'This encoding is not supported'
decode(string)

It works fine but I wanted Decodify to have an ability to decode recursively i.e. ability to decode a string which has been encoded multiple times. So I just tweaked the code a little bit.
And now it looks like this:

string = raw_input('Enter the string you want to decode: ')
def decode(string):
    def binary(string):
        #decoding happens here
        string = decoded_string
        decode(string)
        quit()
    bina = search(r'^[01]+$', string)
    if bina:
        binary(string)
        string = decoded_string
        decode(string)
        quit()
    def fromchar(string):
        #decoding happens here
        string = decoded_string
        decode(string)
        quit()
    jv_char = search(r'\d+, \d+,', string)
    elif jv_char:
        fromchar(string)
    def urle(string):
        #decoding happens here
        string = decoded_string
        decode(string)
        quit()
    url = search(r'%..%..+', string)
    elif url:
        urle(string)
    def base64(string):
        #decoding happens here
        string = decoded_string
        decode(string)
        quit()
    b64 = search(r'[A-Za-z0-9+\/=]+', string)
    elif len(string)%4 == 0 and b64:
        base64(string)
    else:
        print 'This encoding is not supported'
decode(string)

So what changed here? The user enters a string, the string is checked for its encoding type and once a match is found, it gets decoded. Then the decoded text is checked again if its encoded and decoded again if a match is found. It keeps going as long as the decoded text is found to be encoded.
Phewww! Thats all for now. See ya in next chapter of Ultimate Python.
Keep Learning! Keep Coding!


Subscribe Now

Subscribe for free and get latest articles delivered right into your inbox.

Thank you for subscribing.

Something went wrong.

Categories

>-----ADVERTISEMENT-----<