Another of the subprograms I wrote as I worked through a group of files containing HTML code. Pass this subprogram a buffer containing a unit of HTML code (I have been calling it a paragraph, because what I am working on will ultimately be viewed as electronic books) and it will return a table where each entry is either an HTML tag item (bounded by < and >) or a text item (bounded by the end of the previous tag and the next space).
To install the source programs and compile them, download and execute the bash script: downloads/breakHTMLparagraph.setup [md5: b4393fc006b8f0da58b7b5ecab465156]. Right click and save the script in the location where you want the source program to reside, then execute it. It will create two files containing the source programs, a copybook, and a sample HTML data file, then execute the GnuCOBOL compiler to compile them into the run unit for the test program, and finally will execute the test program. The GnuCOBOL compiler must be installed prior to executing the bash script. Also, this subprogram calls my stringWords subprogram to handle the text blocks, so you will need to have that installed prior to executing a program with breakHTMLpara called.
I have found that I occasionally have some HTML files where I need to expand either the number of elements in the returned table or the size of the 'word' returned (always because of exceesively large tag strings found in eBooks). I have attempted to craft the routine to derive the necessary run time information from the table passed by the caller, so that it should be possible to increase the number of entries and/or size of the entry 'word' by simply changing the copybook and recompiling the subprogram and caller.
Output from the test/verification program:
jay@Phoenix ~$ ./testSubs Record #000001 Return code: +000000000 Tag/Text Items found: +0005 Item #+0001 Tag Length: +0004: <h1> Item #+0002 Text Length: +0004: HTML Item #+0003 Text Length: +0005: Ipsum Item #+0004 Text Length: +0008: Presents Item #+0005 Tag Length: +0005: </h1> Record #000003 Return code: +000000000 Tag/Text Items found: +0096 Item #+0001 Tag Length: +0003: <p> Item #+0002 Tag Length: +0008: <strong> Item #+0003 Text Length: +0012: Pellentesque Item #+0004 Text Length: +0008: habitant Item #+0005 Text Length: +0005: morbi Item #+0006 Text Length: +0009: tristique Item #+0007 Tag Length: +0009: </strong> Item #+0008 Text Length: +0008: senectus Item #+0009 Text Length: +0002: et Item #+0010 Text Length: +0005: netus Item #+0011 Text Length: +0002: et Item #+0012 Text Length: +0009: malesuada Item #+0013 Text Length: +0005: fames Item #+0014 Text Length: +0002: ac Item #+0015 Text Length: +0006: turpis Item #+0016 Text Length: +0008: egestas. Item #+0017 Text Length: +0010: Vestibulum Item #+0018 Text Length: +0006: tortor Item #+0019 Text Length: +0005: quam, Item #+0020 Text Length: +0007: feugiat Item #+0021 Text Length: +0006: vitae, Item #+0022 Text Length: +0009: ultricies Item #+0023 Text Length: +0005: eget, Item #+0024 Text Length: +0006: tempor Item #+0025 Text Length: +0003: sit Item #+0026 Text Length: +0005: amet, Item #+0027 Text Length: +0005: ante. Item #+0028 Text Length: +0005: Donec Item #+0029 Text Length: +0002: eu Item #+0030 Text Length: +0006: libero Item #+0031 Text Length: +0003: sit Item #+0032 Text Length: +0004: amet Item #+0033 Text Length: +0004: quam Item #+0034 Text Length: +0007: egestas Item #+0035 Text Length: +0007: semper. Item #+0036 Tag Length: +0004: <em> Item #+0037 Text Length: +0006: Aenean Item #+0038 Text Length: +0009: ultricies Item #+0039 Text Length: +0002: mi Item #+0040 Text Length: +0005: vitae Item #+0041 Text Length: +0004: est. Item #+0042 Tag Length: +0005: </em> Item #+0043 Text Length: +0006: Mauris Item #+0044 Text Length: +0008: placerat Item #+0045 Text Length: +0008: eleifend Item #+0046 Text Length: +0004: leo. Item #+0047 Text Length: +0007: Quisque Item #+0048 Text Length: +0003: sit Item #+0049 Text Length: +0004: amet Item #+0050 Text Length: +0003: est Item #+0051 Text Length: +0002: et Item #+0052 Text Length: +0006: sapien Item #+0053 Text Length: +0011: ullamcorper Item #+0054 Text Length: +0009: pharetra. Item #+0055 Text Length: +0010: Vestibulum Item #+0056 Text Length: +0004: erat Item #+0057 Text Length: +0005: wisi, Item #+0058 Text Length: +0011: condimentum Item #+0059 Text Length: +0004: sed, Item #+0060 Tag Length: +0006: <code> Item #+0061 Text Length: +0007: commodo Item #+0062 Text Length: +0005: vitae Item #+0063 Tag Length: +0007: </code> Item #+0064 Text Length: +0001: , Item #+0065 Text Length: +0006: ornare Item #+0066 Text Length: +0003: sit Item #+0067 Text Length: +0005: amet, Item #+0068 Text Length: +0005: wisi. Item #+0069 Text Length: +0006: Aenean Item #+0070 Text Length: +0010: fermentum, Item #+0071 Text Length: +0004: elit Item #+0072 Text Length: +0004: eget Item #+0073 Text Length: +0009: tincidunt Item #+0074 Text Length: +0012: condimentum, Item #+0075 Text Length: +0004: eros Item #+0076 Text Length: +0005: ipsum Item #+0077 Text Length: +0006: rutrum Item #+0078 Text Length: +0005: orci, Item #+0079 Text Length: +0008: sagittis Item #+0080 Text Length: +0006: tempus Item #+0081 Text Length: +0005: lacus Item #+0082 Text Length: +0004: enim Item #+0083 Text Length: +0002: ac Item #+0084 Text Length: +0004: dui. Item #+0085 Tag Length: +0012: <a href="#"> Item #+0086 Text Length: +0005: Donec Item #+0087 Text Length: +0003: non Item #+0088 Text Length: +0004: enim Item #+0089 Tag Length: +0004: </a> Item #+0090 Text Length: +0002: in Item #+0091 Text Length: +0006: turpis Item #+0092 Text Length: +0008: pulvinar Item #+0093 Text Length: +0010: facilisis. Item #+0094 Text Length: +0002: Ut Item #+0095 Text Length: +0006: felis. Item #+0096 Tag Length: +0004: </p> Record #000005 Return code: +000000000 Tag/Text Items found: +0005 Item #+0001 Tag Length: +0004: <h2> Item #+0002 Text Length: +0006: Header Item #+0003 Text Length: +0005: Level Item #+0004 Text Length: +0001: 2 Item #+0005 Tag Length: +0005: </h2> Record #000007 Return code: +000000000 Tag/Text Items found: +0001 Item #+0001 Tag Length: +0004: <ol> Record #000008 Return code: +000000000 Tag/Text Items found: +0010 Item #+0001 Tag Length: +0004: <li> Item #+0002 Text Length: +0005: Lorem Item #+0003 Text Length: +0005: ipsum Item #+0004 Text Length: +0005: dolor Item #+0005 Text Length: +0003: sit Item #+0006 Text Length: +0005: amet, Item #+0007 Text Length: +0012: consectetuer Item #+0008 Text Length: +0010: adipiscing Item #+0009 Text Length: +0005: elit. Item #+0010 Tag Length: +0005: </li> Record #000009 Return code: +000000000 Tag/Text Items found: +0007 Item #+0001 Tag Length: +0004: <li> Item #+0002 Text Length: +0007: Aliquam Item #+0003 Text Length: +0009: tincidunt Item #+0004 Text Length: +0006: mauris Item #+0005 Text Length: +0002: eu Item #+0006 Text Length: +0006: risus. Item #+0007 Tag Length: +0005: </li> Record #000010 Return code: +000000000 Tag/Text Items found: +0001 Item #+0001 Tag Length: +0005: </ol> Record #000012 Return code: +000000000 Tag/Text Items found: +0051 Item #+0001 Tag Length: +0012: <blockquote> Item #+0002 Tag Length: +0003: <p> Item #+0003 Text Length: +0005: Lorem Item #+0004 Text Length: +0005: ipsum Item #+0005 Text Length: +0005: dolor Item #+0006 Text Length: +0003: sit Item #+0007 Text Length: +0005: amet, Item #+0008 Text Length: +0011: consectetur Item #+0009 Text Length: +0010: adipiscing Item #+0010 Text Length: +0005: elit. Item #+0011 Text Length: +0007: Vivamus Item #+0012 Text Length: +0006: magna. Item #+0013 Text Length: +0004: Cras Item #+0014 Text Length: +0002: in Item #+0015 Text Length: +0002: mi Item #+0016 Text Length: +0002: at Item #+0017 Text Length: +0005: felis Item #+0018 Text Length: +0007: aliquet Item #+0019 Text Length: +0007: congue. Item #+0020 Text Length: +0002: Ut Item #+0021 Text Length: +0001: a Item #+0022 Text Length: +0003: est Item #+0023 Text Length: +0004: eget Item #+0024 Text Length: +0006: ligula Item #+0025 Text Length: +0008: molestie Item #+0026 Text Length: +0008: gravida. Item #+0027 Text Length: +0009: Curabitur Item #+0028 Text Length: +0006: massa. Item #+0029 Text Length: +0005: Donec Item #+0030 Text Length: +0009: eleifend, Item #+0031 Text Length: +0006: libero Item #+0032 Text Length: +0002: at Item #+0033 Text Length: +0008: sagittis Item #+0034 Text Length: +0007: mollis, Item #+0035 Text Length: +0006: tellus Item #+0036 Text Length: +0003: est Item #+0037 Text Length: +0009: malesuada Item #+0038 Text Length: +0007: tellus, Item #+0039 Text Length: +0002: at Item #+0040 Text Length: +0006: luctus Item #+0041 Text Length: +0006: turpis Item #+0042 Text Length: +0004: elit Item #+0043 Text Length: +0003: sit Item #+0044 Text Length: +0004: amet Item #+0045 Text Length: +0005: quam. Item #+0046 Text Length: +0007: Vivamus Item #+0047 Text Length: +0007: pretium Item #+0048 Text Length: +0006: ornare Item #+0049 Text Length: +0004: est. Item #+0050 Tag Length: +0004: </p> Item #+0051 Tag Length: +0013: </blockquote> Record #000014 Return code: +000000000 Tag/Text Items found: +0005 Item #+0001 Tag Length: +0004: <h3> Item #+0002 Text Length: +0006: Header Item #+0003 Text Length: +0005: Level Item #+0004 Text Length: +0001: 3 Item #+0005 Tag Length: +0005: </h3> Record #000016 Return code: +000000000 Tag/Text Items found: +0001 Item #+0001 Tag Length: +0004: <ul> Record #000017 Return code: +000000000 Tag/Text Items found: +0010 Item #+0001 Tag Length: +0004: <li> Item #+0002 Text Length: +0005: Lorem Item #+0003 Text Length: +0005: ipsum Item #+0004 Text Length: +0005: dolor Item #+0005 Text Length: +0003: sit Item #+0006 Text Length: +0005: amet, Item #+0007 Text Length: +0012: consectetuer Item #+0008 Text Length: +0010: adipiscing Item #+0009 Text Length: +0005: elit. Item #+0010 Tag Length: +0005: </li> Record #000018 Return code: +000000000 Tag/Text Items found: +0007 Item #+0001 Tag Length: +0004: <li> Item #+0002 Text Length: +0007: Aliquam Item #+0003 Text Length: +0009: tincidunt Item #+0004 Text Length: +0006: mauris Item #+0005 Text Length: +0002: eu Item #+0006 Text Length: +0006: risus. Item #+0007 Tag Length: +0005: </li> Record #000019 Return code: +000000000 Tag/Text Items found: +0001 Item #+0001 Tag Length: +0005: </ul> Record #000020 Return code: +000000000 Tag/Text Items found: +0002 Item #+0001 Tag Length: +0007: </code> Item #+0002 Tag Length: +0006: </pre> jay@Phoenix ~$
If you want to be able to dynamically call the subprogram move the object module (breakHTMLpara.so) to a location included in your COB_LIBRARY_PATH.
This page was last updated on April 06, 2021.