query-replace-regexp or “making movies enjoyable”

One of the movies I recently watched came with subtitles in a separate *.srt file. The subtitles had 99% correct timing information, a pretty amazing feat. They had a minor glitch though. Many instances of lowercase ‘el’ had been replaced with uppercase ‘i’.

I noticed this tiny glitch in the first few lines of text, and then discovered that it was a buglet that occured far too often. My OCD side started feeling bad about the movie, because all the bogus capital ‘i’ letters were distracting me from the “real” fun of watching the actual movie.

So I stopped watching, and I fired up GNU Emacs on the srt file.

After 1-2 minutes of work, and a lot of fun with query-replace-regexp I found a nice replacement pattern to interactively fix all the broken ‘i’ instances:

M-x query-replace-regexp RET
    \([^I ]*\)\(I+\)\([^I ]*\) RET
    \1\,(replace-regexp-in-string "i" "l" (downcase \2))\3 RET

This had quite a few false positives, but it did 95% of the work, so I manually fixed the 5-10 instances it didn’t catch, and I was finally able to enjoy the movie.

Note: The perceptive reader who is also a fan of regular expressions will probably notice very quickly that part of the third line is Lisp code. This is an amazing feature of regexp replacement in Emacs. When the special pattern \, appears in the replacement text it evaluates the following expression as Emacs Lisp. An arbitrarily complex Lisp expression can be used after \, and its return value is used as the replacement text.

I’m positively thrilled that Emacs saved the day… again :-)

7 thoughts on “query-replace-regexp or “making movies enjoyable”

  1. karthik

    I think sed would have saved the day faster. :P

    On the whole, I’m a fan of Emacs, but I find that the regular expression syntax is a bit odd compared to perl/sed/grep. I’m used to the latter set, and it’s jarring when I can’t write a simple Regex (like the above one) in Emacs. (It’s different in Vim from all of the above, too. Oh, woe!)

    But I’ve only been learning Elisp for a month now, so I didn’t know of the \, operator. That is simply amazing. Thanks for singling that out and explaining.

  2. keramida Post author

    karthik, you are right, sed/perl would probably do the same.

    There are two bits that are important here though. One of them is the dynamic Lisp bits in the replacement pattern (glad you liked it).

    The other one is the interactive prompt for each replacement. query-replace* functions prompt for a yes/no response, so you can quickly go through a large file and keep typing “y y y n y y y n n y y n y y y” to replace only some of the matches.

    I’m not sure if I can do the same with a relatively small Perl script :)

  3. aaron

    I haven’t studied this intently, nor do I understand the scenario you have, perhaps. But couldn’t you have done the following, C-M-% \BI\B\|\BI\|I\B RET l RET

  4. keramida Post author

    aaron that was my first attempt. Then I realized there are phrases like “I will”, “I thought I said ‘foo'” and similar. Many of these should really be using an uppercase ‘i’ instead of a lowercase ‘l’.

    1. aaron

      Right, the regular expression, “\BI\B\|\BI\|I\B”, should only match uppercase I’s that aren’t alone. Avoiding any, as you say, “false positives”.

Comments are closed.