Break HTML Paragraph into tag/text Fields

 

Another of the subprograms I wrote as I worked through a group of files containing HTML code.  Pass this subprogram a buffer containing a unit of HTML code (I have been calling it a paragraph, because what I am working on will ultimately be viewed as electronic books) and it will return a table where each entry is either an HTML tag item (bounded by < and >) or a text item (bounded by the end of the previous tag and the next space).

To install the source programs and compile them, download and execute the bash script:  downloads/breakHTMLparagraph.setup  [md5: 6df054d6842b3f6a434a348ccfcf41b0].  Right click and save the script in the location where you want the source program to reside, then execute it.  It will create two files containing the source programs, a copybook, and a sample HTML data file, then execute the GnuCOBOL compiler to compile them into the run unit for the test program.  The GnuCOBOL compiler must be installed prior to executing the bash script.  Also, this subprogram calls my stringWords subprogram to handle the text blocks, so you will need to have that installed prior to executing a program with breakHTMLpara called.

I have found that I occasionally have some HTML files where I need to expand either the number of elements in the returned table or the size of the 'word' returned (always because of exceesively large tag strings found in eBooks).  I have attempted to craft the routine to derive the necessary run time information from the table passed by the caller, so that it should be possible to increase the number of entries and/or size of the entry 'word' by simply changing the copybook and recompiling the subprogram and caller.

Output from the test/verification program:

jay@Phoenix ~$ ./testSubs 
Record #000001
Return code: +000000000
Tag/Text Items found: +0005
Item #+0001 Tag   Length: +0004: <h1>                                              
Item #+0002 Text  Length: +0004: HTML                                              
Item #+0003 Text  Length: +0005: Ipsum                                             
Item #+0004 Text  Length: +0008: Presents                                          
Item #+0005 Tag   Length: +0005: </h1>                                             
 
Record #000003
Return code: +000000000
Tag/Text Items found: +0096
Item #+0001 Tag   Length: +0003: <p>                                               
Item #+0002 Tag   Length: +0008: <strong>                                          
Item #+0003 Text  Length: +0012: Pellentesque                                      
Item #+0004 Text  Length: +0008: habitant                                          
Item #+0005 Text  Length: +0005: morbi                                             
Item #+0006 Text  Length: +0009: tristique                                         
Item #+0007 Tag   Length: +0009: </strong>                                         
Item #+0008 Text  Length: +0008: senectus                                          
Item #+0009 Text  Length: +0002: et                                                
Item #+0010 Text  Length: +0005: netus                                             
Item #+0011 Text  Length: +0002: et                                                
Item #+0012 Text  Length: +0009: malesuada                                         
Item #+0013 Text  Length: +0005: fames                                             
Item #+0014 Text  Length: +0002: ac                                                
Item #+0015 Text  Length: +0006: turpis                                            
Item #+0016 Text  Length: +0008: egestas.                                          
Item #+0017 Text  Length: +0010: Vestibulum                                        
Item #+0018 Text  Length: +0006: tortor                                            
Item #+0019 Text  Length: +0005: quam,                                             
Item #+0020 Text  Length: +0007: feugiat                                           
Item #+0021 Text  Length: +0006: vitae,                                            
Item #+0022 Text  Length: +0009: ultricies                                         
Item #+0023 Text  Length: +0005: eget,                                             
Item #+0024 Text  Length: +0006: tempor                                            
Item #+0025 Text  Length: +0003: sit                                               
Item #+0026 Text  Length: +0005: amet,                                             
Item #+0027 Text  Length: +0005: ante.                                             
Item #+0028 Text  Length: +0005: Donec                                             
Item #+0029 Text  Length: +0002: eu                                                
Item #+0030 Text  Length: +0006: libero                                            
Item #+0031 Text  Length: +0003: sit                                               
Item #+0032 Text  Length: +0004: amet                                              
Item #+0033 Text  Length: +0004: quam                                              
Item #+0034 Text  Length: +0007: egestas                                           
Item #+0035 Text  Length: +0007: semper.                                           
Item #+0036 Tag   Length: +0004: <em>                                              
Item #+0037 Text  Length: +0006: Aenean                                            
Item #+0038 Text  Length: +0009: ultricies                                         
Item #+0039 Text  Length: +0002: mi                                                
Item #+0040 Text  Length: +0005: vitae                                             
Item #+0041 Text  Length: +0004: est.                                              
Item #+0042 Tag   Length: +0005: </em>                                             
Item #+0043 Text  Length: +0006: Mauris                                            
Item #+0044 Text  Length: +0008: placerat                                          
Item #+0045 Text  Length: +0008: eleifend                                          
Item #+0046 Text  Length: +0004: leo.                                              
Item #+0047 Text  Length: +0007: Quisque                                           
Item #+0048 Text  Length: +0003: sit                                               
Item #+0049 Text  Length: +0004: amet                                              
Item #+0050 Text  Length: +0003: est                                               
Item #+0051 Text  Length: +0002: et                                                
Item #+0052 Text  Length: +0006: sapien                                            
Item #+0053 Text  Length: +0011: ullamcorper                                       
Item #+0054 Text  Length: +0009: pharetra.                                         
Item #+0055 Text  Length: +0010: Vestibulum                                        
Item #+0056 Text  Length: +0004: erat                                              
Item #+0057 Text  Length: +0005: wisi,                                             
Item #+0058 Text  Length: +0011: condimentum                                       
Item #+0059 Text  Length: +0004: sed,                                              
Item #+0060 Tag   Length: +0006: <code>                                            
Item #+0061 Text  Length: +0007: commodo                                           
Item #+0062 Text  Length: +0005: vitae                                             
Item #+0063 Tag   Length: +0007: </code>                                           
Item #+0064 Text  Length: +0001: ,                                                 
Item #+0065 Text  Length: +0006: ornare                                            
Item #+0066 Text  Length: +0003: sit                                               
Item #+0067 Text  Length: +0005: amet,                                             
Item #+0068 Text  Length: +0005: wisi.                                             
Item #+0069 Text  Length: +0006: Aenean                                            
Item #+0070 Text  Length: +0010: fermentum,                                        
Item #+0071 Text  Length: +0004: elit                                              
Item #+0072 Text  Length: +0004: eget                                              
Item #+0073 Text  Length: +0009: tincidunt                                         
Item #+0074 Text  Length: +0012: condimentum,                                      
Item #+0075 Text  Length: +0004: eros                                              
Item #+0076 Text  Length: +0005: ipsum                                             
Item #+0077 Text  Length: +0006: rutrum                                            
Item #+0078 Text  Length: +0005: orci,                                             
Item #+0079 Text  Length: +0008: sagittis                                          
Item #+0080 Text  Length: +0006: tempus                                            
Item #+0081 Text  Length: +0005: lacus                                             
Item #+0082 Text  Length: +0004: enim                                              
Item #+0083 Text  Length: +0002: ac                                                
Item #+0084 Text  Length: +0004: dui.                                              
Item #+0085 Tag   Length: +0012: <a href="#">                                      
Item #+0086 Text  Length: +0005: Donec                                             
Item #+0087 Text  Length: +0003: non                                               
Item #+0088 Text  Length: +0004: enim                                              
Item #+0089 Tag   Length: +0004: </a>                                              
Item #+0090 Text  Length: +0002: in                                                
Item #+0091 Text  Length: +0006: turpis                                            
Item #+0092 Text  Length: +0008: pulvinar                                          
Item #+0093 Text  Length: +0010: facilisis.                                        
Item #+0094 Text  Length: +0002: Ut                                                
Item #+0095 Text  Length: +0006: felis.                                            
Item #+0096 Tag   Length: +0004: </p>                                              
 
Record #000005
Return code: +000000000
Tag/Text Items found: +0005
Item #+0001 Tag   Length: +0004: <h2>                                              
Item #+0002 Text  Length: +0006: Header                                            
Item #+0003 Text  Length: +0005: Level                                             
Item #+0004 Text  Length: +0001: 2                                                 
Item #+0005 Tag   Length: +0005: </h2>                                             
 
Record #000007
Return code: +000000000
Tag/Text Items found: +0001
Item #+0001 Tag   Length: +0004: <ol>                                              
 
Record #000008
Return code: +000000000
Tag/Text Items found: +0010
Item #+0001 Tag   Length: +0004: <li>                                              
Item #+0002 Text  Length: +0005: Lorem                                             
Item #+0003 Text  Length: +0005: ipsum                                             
Item #+0004 Text  Length: +0005: dolor                                             
Item #+0005 Text  Length: +0003: sit                                               
Item #+0006 Text  Length: +0005: amet,                                             
Item #+0007 Text  Length: +0012: consectetuer                                      
Item #+0008 Text  Length: +0010: adipiscing                                        
Item #+0009 Text  Length: +0005: elit.                                             
Item #+0010 Tag   Length: +0005: </li>                                             
 
Record #000009
Return code: +000000000
Tag/Text Items found: +0007
Item #+0001 Tag   Length: +0004: <li>                                              
Item #+0002 Text  Length: +0007: Aliquam                                           
Item #+0003 Text  Length: +0009: tincidunt                                         
Item #+0004 Text  Length: +0006: mauris                                            
Item #+0005 Text  Length: +0002: eu                                                
Item #+0006 Text  Length: +0006: risus.                                            
Item #+0007 Tag   Length: +0005: </li>                                             
 
Record #000010
Return code: +000000000
Tag/Text Items found: +0001
Item #+0001 Tag   Length: +0005: </ol>                                             
 
Record #000012
Return code: +000000000
Tag/Text Items found: +0051
Item #+0001 Tag   Length: +0012: <blockquote>                                      
Item #+0002 Tag   Length: +0003: <p>                                               
Item #+0003 Text  Length: +0005: Lorem                                             
Item #+0004 Text  Length: +0005: ipsum                                             
Item #+0005 Text  Length: +0005: dolor                                             
Item #+0006 Text  Length: +0003: sit                                               
Item #+0007 Text  Length: +0005: amet,                                             
Item #+0008 Text  Length: +0011: consectetur                                       
Item #+0009 Text  Length: +0010: adipiscing                                        
Item #+0010 Text  Length: +0005: elit.                                             
Item #+0011 Text  Length: +0007: Vivamus                                           
Item #+0012 Text  Length: +0006: magna.                                            
Item #+0013 Text  Length: +0004: Cras                                              
Item #+0014 Text  Length: +0002: in                                                
Item #+0015 Text  Length: +0002: mi                                                
Item #+0016 Text  Length: +0002: at                                                
Item #+0017 Text  Length: +0005: felis                                             
Item #+0018 Text  Length: +0007: aliquet                                           
Item #+0019 Text  Length: +0007: congue.                                           
Item #+0020 Text  Length: +0002: Ut                                                
Item #+0021 Text  Length: +0001: a                                                 
Item #+0022 Text  Length: +0003: est                                               
Item #+0023 Text  Length: +0004: eget                                              
Item #+0024 Text  Length: +0006: ligula                                            
Item #+0025 Text  Length: +0008: molestie                                          
Item #+0026 Text  Length: +0008: gravida.                                          
Item #+0027 Text  Length: +0009: Curabitur                                         
Item #+0028 Text  Length: +0006: massa.                                            
Item #+0029 Text  Length: +0005: Donec                                             
Item #+0030 Text  Length: +0009: eleifend,                                         
Item #+0031 Text  Length: +0006: libero                                            
Item #+0032 Text  Length: +0002: at                                                
Item #+0033 Text  Length: +0008: sagittis                                          
Item #+0034 Text  Length: +0007: mollis,                                           
Item #+0035 Text  Length: +0006: tellus                                            
Item #+0036 Text  Length: +0003: est                                               
Item #+0037 Text  Length: +0009: malesuada                                         
Item #+0038 Text  Length: +0007: tellus,                                           
Item #+0039 Text  Length: +0002: at                                                
Item #+0040 Text  Length: +0006: luctus                                            
Item #+0041 Text  Length: +0006: turpis                                            
Item #+0042 Text  Length: +0004: elit                                              
Item #+0043 Text  Length: +0003: sit                                               
Item #+0044 Text  Length: +0004: amet                                              
Item #+0045 Text  Length: +0005: quam.                                             
Item #+0046 Text  Length: +0007: Vivamus                                           
Item #+0047 Text  Length: +0007: pretium                                           
Item #+0048 Text  Length: +0006: ornare                                            
Item #+0049 Text  Length: +0004: est.                                              
Item #+0050 Tag   Length: +0004: </p>                                              
Item #+0051 Tag   Length: +0013: </blockquote>                                     
 
Record #000014
Return code: +000000000
Tag/Text Items found: +0005
Item #+0001 Tag   Length: +0004: <h3>                                              
Item #+0002 Text  Length: +0006: Header                                            
Item #+0003 Text  Length: +0005: Level                                             
Item #+0004 Text  Length: +0001: 3                                                 
Item #+0005 Tag   Length: +0005: </h3>                                             
 
Record #000016
Return code: +000000000
Tag/Text Items found: +0001
Item #+0001 Tag   Length: +0004: <ul>                                              
 
Record #000017
Return code: +000000000
Tag/Text Items found: +0010
Item #+0001 Tag   Length: +0004: <li>                                              
Item #+0002 Text  Length: +0005: Lorem                                             
Item #+0003 Text  Length: +0005: ipsum                                             
Item #+0004 Text  Length: +0005: dolor                                             
Item #+0005 Text  Length: +0003: sit                                               
Item #+0006 Text  Length: +0005: amet,                                             
Item #+0007 Text  Length: +0012: consectetuer                                      
Item #+0008 Text  Length: +0010: adipiscing                                        
Item #+0009 Text  Length: +0005: elit.                                             
Item #+0010 Tag   Length: +0005: </li>                                             
 
Record #000018
Return code: +000000000
Tag/Text Items found: +0007
Item #+0001 Tag   Length: +0004: <li>                                              
Item #+0002 Text  Length: +0007: Aliquam                                           
Item #+0003 Text  Length: +0009: tincidunt                                         
Item #+0004 Text  Length: +0006: mauris                                            
Item #+0005 Text  Length: +0002: eu                                                
Item #+0006 Text  Length: +0006: risus.                                            
Item #+0007 Tag   Length: +0005: </li>                                             
 
Record #000019
Return code: +000000000
Tag/Text Items found: +0001
Item #+0001 Tag   Length: +0005: </ul>                                             
 
Record #000020
Return code: +000000000
Tag/Text Items found: +0002
Item #+0001 Tag   Length: +0007: </code>                                           
Item #+0002 Tag   Length: +0006: </pre>                                            
 
jay@Phoenix ~$

If you want to be able to dynamically call the subprogram move the object module (breakHTMLpara.so) to a location included in your COB_LIBRARY_PATH.


This page was last updated on November 10, 2019.