Re: html->ascii

Frank Ziegler (zief@afsmail.cern.ch)
Mon, 27 Jan 1997 17:55:47 +0100 (MET)


On Mon, 27 Jan 1997, Fons Rademakers wrote:

> I am sure there must be standard programs out there that
> do exactly that: stripping HTML out of a file. Probably
> some small Perl script will do. Anybody has such a script?
>
> Cheers, Fons.
>
>

I found such a script converting html to ascii. It seems to be not
copyrighted :-)

Frank

_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
_/_/_/ eMail: Frank.Ziegler@cern.ch _/_/_/
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/

-------- start ---------

#!/bin/sh
#
# Name: html2ascii
#
# Description:
# Convert HTML files to ascii. Makes empty lines empty.
#
# Exit Values:
# 0 Successful completion.
# 1
#
# Inputs:
# 0 or more files (stdin is read for 0 files)
#
# Outputs:
#
# Environment:
# DEBUG is list of scripts to debug
# DEBUG_log is log file for debugging output
#
# Modified by:
# 11/01/95 Howard Fear Original
#
# Notes:
# 01/01/96 Howard Fear Change LYNX to be correct on your system.
#
LYNX=/usr/local/bin/lynx

set -h # remember functions
self="`basename $0`"
die () { echo ${1:+"$self: $*"} >&2; kill -TERM $$; }
warn () { echo ${1:+"$self: $*"} >&2; }
usage () { echo "usage: $self files..." >&2; exit 1; }
: ${TMPDIR:=/tmp}

trap 'rm core >/dev/null 2>&1; exit 1' 0 15
case ":${DEBUG:=$debug}:" in
*:$self:* | :all: | :true: | :on: )
set -vx
test "$DEBUG_LOG" && exec 2>>$DEBUG_LOG
;;
esac

# use stdin if no files present
test $# -eq 0 && set -- -
for i
do
case "$i" in
- ) # process stdin
# lynx occasionally core dumps so we remove core as well
trap "rm $TMPDIR/$$.html core >/dev/null 2>&1" 0 15
touch $TMPDIR/$$.html || {
warn "Could not create $TMPDIR/$$.html"
continue
}
while read line; do echo "$line" >> $TMPDIR/$$.html; done
filename="$TMPDIR/$$.html"
;;
* ) filename="$i" ;;
esac
( $LYNX -dump $filename || die "Could not index file $filename" ) \
| sed -e 's/[ ]*$//'
echo
done

exit 0

------ schnipp -----