How to extract a substring from inside a string in Python?
How to extract a substring from inside a string in Python?
Let's say I have a string 'gfgfdAAA1234ZZZuijjk'
and I want to extract just the '1234'
part.
'gfgfdAAA1234ZZZuijjk'
'1234'
I only know what will be the few characters directly before AAA
, and after ZZZ
the part I am interested in 1234
.
AAA
ZZZ
1234
With sed
it is possible to do something like this with a string:
sed
echo "$STRING" | sed -e "s|.*AAA(.*)ZZZ.*|1|"
And this will give me 1234
as a result.
1234
How to do the same thing in Python?
12 Answers
12
Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
Doesn't the indexing start at 0? So you would need to use group(0) instead of group(1)?
– Alexander
Nov 8 '15 at 22:16
@Alexander, no, group(0) will return full matched string: AAA1234ZZZ, and group(1) will return only characters matched by first group: 1234
– Yurii K
Nov 12 '15 at 13:46
@Bengt: Why is that? The first solution looks quite simple to me, and it has fewer lines of code.
– HelloGoodbye
Jul 7 '16 at 13:21
In this expression the ? modifies the + to be non-greedy, ie. it will match any number of times from 1 upwards but as few as possible, only expanding as necessary. without the ?, the first group would match gfgfAAA2ZZZkeAAA43ZZZonife as 2ZZZkeAAA43, but with the ? it would only match the 2, then searching for multiple (or having it stripped out and search again) would match the 43.
– Dom
Jul 19 '17 at 8:31
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.
The question seems to imply that the input text will always contain both "AAA" and "ZZZ". If this is not the case, your answer fails horribly (by that I mean it returns something completely wrong instead of an empty string or throwing an exception; think "hello there" as input string).
– tzot
Feb 6 '11 at 23:46
@user225312 Is the
re
method not faster though?– confused00
Jul 21 '16 at 9:25
re
Voteup, but I would use "x = 'AAA' ; s.find(x) + len(x)" instead of "s.find('AAA') + 3" for maintainability.
– Alex
Jun 21 '17 at 8:47
If any of the tokens can't be found in the
s
, s.find
will return -1
. the slicing operator s[begin:end]
will accept it as valid index, and return undesired substring.– ribamar
Aug 28 '17 at 15:44
s
s.find
-1
s[begin:end]
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError
if there are no "AAA" and "ZZZ" in your_text
AttributeError
your_text
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text
.
your_text
PS Python Challenge?
This answer probably deserves more up votes. The string method is the most robust way. It does not need a try/except.
– ChaimG
Dec 3 '15 at 2:59
... nice, though limited. partition is not regex based, so it only works in this instance because the search string was bounded by fixed literals
– GreenAsJade
Feb 29 '16 at 2:07
Great, many thanks! - this works for strings and does not require regex
– Alex
Jun 8 at 11:53
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
AttributeError: 'NoneType' object has no attribute 'groups'
- if there is no AAA, ZZZ in the string...– eumiro
Jan 12 '11 at 9:20
AttributeError: 'NoneType' object has no attribute 'groups'
You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA(.*)ZZZ.*|1|"
echo "$STRING" | sed -e "s|.*AAA(.*)ZZZ.*|1|"
And this will give me 1234 as a result.
You could do the same with re.sub
function using the same regex.
re.sub
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by (..)
, but in python it was represented by (..)
.
(..)
(..)
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring "%s"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("nYour answer:n%s" %(Text[AfterText1:BeforText2]))
Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*((.*?)).*'
matches = re.search(regex, line)
line = matches.group(1) + 'n'
I.e. you need to block parenthesis with slash
sign. Though it is a problem about more regular expressions that Python.
slash
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
you can do using just one line of code
>>> import re
>>> re.findall(r'd{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...
In python, extracting substring form string can be done using findall
method in regular expression (re
) module.
findall
re
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']
One liners that return other string if there was no match.
Edit: improved version uses next
function, replace "not-found"
with something else if needed:
next
"not-found"
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
The second solution is better, if the pattern matches most of the time, because its Easier to ask for forgiveness than permission..
– Bengt
Jan 14 '13 at 16:11