Hi, my goal is to be able to run a script to remove duplicate strings of text and merge what remains from incremental style subtitles to create a plain text transcript of just the text contents of the subtitle file (sans subtitle timings or any format commands or control characters).
An example to help you help me.
I want this SRT file with incremental subtitles (there are two rows of subtitles showing at all times, the currently spoken subtitle and the previous subtitle for slow readers, this is similar to what you see on T.V. with real-time captions.
![]()
To become this:
All the text is merged with no overlapping duplicates and all on one line. I'd prefer to use command line tools available on the MacOS platform if possible, I use MacPorts and have GNU Core Utilities installed. I can use GUI software, or Windows 10 in a virtual machine for a one-off quick fix, but I'd like an automated bash script or similar that I can trigger on the MacOS platform.
My scripting abilities are quite rudimentary even if I dabble from time to time. My first thought is to regex out the subtitle line numbers and time codes, that should be easy enough for even me. But then how to set up the array and how to compare/match complete lines to partial lines up to maybe 5 or 6 subtitles forward or back in either direction in the array, and then to concatenate/merge whatever is left is well over my head. Would much appreciate guidance on figuring this out.
I've attached the sample subtitle as seen in the screenshot for the convenience of anyone who wants to play around with this and help me out.
Thanks
An example to help you help me.
I want this SRT file with incremental subtitles (there are two rows of subtitles showing at all times, the currently spoken subtitle and the previous subtitle for slow readers, this is similar to what you see on T.V. with real-time captions.
Code:
18
00:00:23,999 --> 00:00:24,009
we've already discussed research and
19
00:00:24,009 --> 00:00:26,460
we've already discussed research and
prevalence barriers and assessment shoes
20
00:00:26,460 --> 00:00:26,470
prevalence barriers and assessment shoes
21
00:00:26,470 --> 00:00:28,769
prevalence barriers and assessment shoes
and now we'll discuss where you can send
22
00:00:28,769 --> 00:00:28,779
and now we'll discuss where you can send
23
00:00:28,779 --> 00:00:31,649
and now we'll discuss where you can send
a client what kind of treatment you can
24
00:00:31,649 --> 00:00:31,659
a client what kind of treatment you can
25
00:00:31,659 --> 00:00:35,000
a client what kind of treatment you can
use and what approaches are available
26
00:00:35,000 --> 00:00:35,010
use and what approaches are available
27
00:00:35,010 --> 00:00:37,560
use and what approaches are available
within the general population
28
00:00:37,560 --> 00:00:37,570
within the general population
29
00:00:37,570 --> 00:00:38,970
within the general population
there are many different approaches to
30
00:00:38,970 --> 00:00:38,980
there are many different approaches to

To become this:
Code:
we've already discussed research and prevalence barriers and assessment shoes and now we'll discuss where you can send a client what kind of treatment you can use and what approaches are available within the general population there are many different approaches to
My scripting abilities are quite rudimentary even if I dabble from time to time. My first thought is to regex out the subtitle line numbers and time codes, that should be easy enough for even me. But then how to set up the array and how to compare/match complete lines to partial lines up to maybe 5 or 6 subtitles forward or back in either direction in the array, and then to concatenate/merge whatever is left is well over my head. Would much appreciate guidance on figuring this out.
I've attached the sample subtitle as seen in the screenshot for the convenience of anyone who wants to play around with this and help me out.
Thanks