It is currently Thu Apr 25, 2024 7:25 pm

All times are UTC - 5 hours [ DST ]




Post new topic Reply to topic  [ 25 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Electronic Document Storage and Indexing
PostPosted: Tue Oct 10, 2006 10:25 pm 

Joined: Thu Jun 16, 2005 11:54 am
Posts: 609
Would a few of the participants who are involved with organizations that maintain libraries or archives be interested in offering their input on this "hypothetical" question?

Suppose that you were about to convert an enormous collection of technical documents to electronic format. Do not bother with copyright issues for this one, we have been over than many times here and it does not affect this particular situation. These documents all have value of their own as artifacts, but the technical information they contain has a vastly greater value and you want them in a format that will readily accessible and usable for some time into the future.

Would PDF be your primary choice of document format, particularly if you have it readily available on computer equipment coupled with a good scanner, or are there any other choices that are better?

What would be your choice of database program to list all these documents and organize an index, if you have Microsoft Access available would that be your choice or can you think of something else that offers any major advantages?

Oh, by the way, lets include a time frame. This project may take ten years to complete and you would like the formats and programs you select to be usable or at least convertible for some time beyond that.

Thanks in advance for your thoughts on this.

MX (not my real name)

_________________
"We Repair No Locomotive Before Its Time"


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 6:08 am 

Joined: Thu Aug 26, 2004 2:53 pm
Posts: 660
mxdata wrote:
Suppose that you were about to convert an enormous collection of technical documents to a format that will readily accessible and usable for some time into the future.

Would PDF be your primary choice of document format, particularly if you have it readily available on computer equipment coupled with a good scanner, or are there any other choices that are better?

What would be your choice of database program to list all these documents and organize an index, if you have Microsoft Access available would that be your choice or can you think of something else that offers any major advantages?

This project may take ten years to complete and you would like the formats and programs you select to be usable or at least convertible for some time beyond that.


If you want readily accessible and future-proof, PDF is probably not the optimum solution, depending on how you define readily accessible. To you, readily accessible might mean "I can print a copy of the original document when I want to." For the purpose of this long-winded diatribe, I'll define it pretty broadly - not only printing a fascimile of the original document, but being able to share the knowledge contained within those originals in different ways, in smaller "chunks."

If you are converting drawings (parts specifications), I'm not aware of a clean "import to CAD" tool - the most cost-effective method of converting these would probably be to hire a college student proficient in a CAD program.

If you are converting documents that are text-based (manuals), I would not recommend converting them to PDF. PDF is a print-layout specification - "electronic paper" - and agnostic about the information within a document. So it is more or less a dead-end if you every want to go beyond describing the visual characteristics of ("seeing") the content. (And there are plenty of tools that create PDFs on-the-fly if you need them.)

ASCII text, image files for illustrations, along with content and formatting information is probably the better choice for text-rich documents. I envision here a table record with fields identifying the document, the XML tag, the content (text or BLOB), style information, and order.

Database format: the MSAccess database structure is useful in "internal" environments, but it's not well-suited to making information available over intranets or the internet. It's also a poor choice for managing "BLOBs" ("Binary Large OBjects," like documents or CAD files) - for some reason they will be 5X-10X larger than "native" when stored in Access, making your database much larger. Finally, Microsoft favors SQL Server over Jet/DAO (Access), and has been converting its product offerings (for example Retail Management System, which is based on QuickSell (which used Access)) pretty agressively. Access will be around as a product in 10 years, but the database structure will change pretty drastically.

On the other hand, MSAccess is great for cost-effective front-end development to e.g. SQL Server or MySQL database structures, which do a much better job of storing BLOBs. And if the data is stored in one of these formats, you can also make it available on the Internet.

However, there are numerous off-the-rack document management solutions, from the high end ("Documentum," now owned by EMC) on down. Use one of these and your database choice is made for you.

I suggest striking up a relationship with your State Archives, for two reasons. First, they will be a great source of free consulting on the best way to handle the conversion (since they do it every day). Second, they will probably know of foundations that give grants for conversion projects. Also, the Society of American Archivists would be a good resource:

http://www.archivists.org/catalog/pubDe ... jectID=284

JAC


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 7:46 am 

Joined: Sun Aug 22, 2004 5:15 am
Posts: 718
Location: Illinois
I am regrettably coming to the conclusion that electronic storage of documents and drawings may be too perishable to be of use to our community. I have several drawings where they are of very limited use having been obsoleted three times in the last ten years through buyouts or mergers of the software, or ugrades of continuing systems.

.PDF files seem to have some staying power but even such stalwarts as AUTOCAD use new software versions which do not always easily or correctly load drawings made in early versions.

And the hardware platforms change to where early software may not load should my old IBM 386 fail. Anybody here old enough to remember LOTUS, GENERIC CAD, WORDPERFECT, and other DOS programs?

Perhaps one solution would be to avoid propriety file formats since almost all sofware seems to change; and to try to save only in public domain formats such as .tiff and .jpeg image files. But those are not easy to massage in database format. Ten years is a very long time in the electronic media, while merely a blink of an eye in our 'business'.

Bob Kutella


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 7:58 am 

Joined: Sat Apr 01, 2006 5:19 pm
Posts: 569
Location: Bowie, MD
IMHO, consider electronic documents to be temporary and approach their use as a matter of function and ease of access, not as an archive. You might get lucky and pick a data format that has lasting power, but long term storage is difficult as storage technologies change very fast and electronic storage media are usually not "permanent" and subject to failure.

Professionally, I've had to retrieve data from old tapes (remember the recent to-do about the lost original tapes from the Moon landing ... kicked off because someone realized they were about to retire the last machine capable of reading the tapes), I've had to shop E-bay for ancient disk drives and then find or recreate software to read them, only to find out the media had gone bad.

One government agency I've worked for that archives technical data has a final backup of several million documents in old fashion microfiche, well understood to last 70+ years if kept in the right conditions, but also keeps most of the collection stored electronically for automated retrieval and access.


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 8:28 am 

Joined: Mon Aug 23, 2004 8:10 am
Posts: 2499
There have been some good points made in this thread. I'll add some thoughts, as this is a top area of concern in my day to day job running a digital media group.

The need you describe is typically called by its acronym: DAM - or digital asset management. Good DAM principles will cover any kind of digital format (video, print, photo, etc...)

There are two key principles: formatting the asset for long term storage and assigning it meaningful metadata.

Formatting is a very difficult decision. For print, which is the topic of the original question, there is no doubt in my mind you should archive 2 versions: PDF and raw text in UTF-8. The former will give you a printable facsimile while the latter will give youi an easiluy searchable text format that will have very little forward compatablity issues.

Also, do yourself a favor and don't skimp on file size. Storage is cheap these days. Save high quality master files.

Metadata is all of the descriptive text that goes with the asset. It is incredibly important. Whatever database you keep it in, export and back up an XML version which will give you a forward compatible file.

Finally, stay away from Acess. You want a robust database that keeps the asset and the metadata together in one easy-to-search db. MySQL is probably your best choice, balancing cost and usability.

Rob


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 9:27 am 

Joined: Mon Aug 23, 2004 3:01 pm
Posts: 1731
Location: SouthEast Pennsylvania
Speaking as a user, I think that the words should be stored as text. PDF seems to be a picture of the text, which is fine for preserving the layout, but hard to electronically search for a word. I don't have a strong opinion about what to do with pictures, diagrams, and tables, where the layout or position of the words is important.


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 9:30 am 

Joined: Thu Jun 16, 2005 11:54 am
Posts: 609
Thanks for your very excellent and helpful suggestions on this. I have been making both PDF scan files (for easy access) and TIFF files (for higher quality images) so it would appear that I am not too far off the mark, I am going to investigate the other suggestions further. The indexing is a project I have not started yet and your views on this are very valuable. I have been using a file naming system that provides its own sort of indexing in this format:

(manufacturer)-(equipment)-(model)-(description)-(notes)-(part or bulletin number)-(version)

This generates a file name such as:

MFR-LOCOMOTIVE-MOD-SCHEMATIC-NONDB-0000000-1

This system does a pretty good job of organizing itself when viewed in Windows Explorer, but in the long term there is an obvious need for something that can go to greater detail.

Thanks again for your suggestions. If this topic is of interest to other participants in the forum I hope they will also offer their views.

MX (not my real name)

_________________
"We Repair No Locomotive Before Its Time"


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 11:10 am 

Hi,
For documents that will pass on to others PDF is a universal format and readily accessable, the reader from Adobe is free.
Files are a bit large at times though.

If you have people who are willing to transcribe information you can then use a text format for any thing that is written. These files are highly compressable and are easy to move from place to place.
Transcribing will increase the time it takes for getting the data online and will have to be proofread for mistakes, all very time consuming.

Using an OCR (opticle code recognition) program can do a lot of the text but still you have to proof read everything and it is almost as time consuming as transcribing it is.

Pictures, art and drawings can be in either JPEG format or as PDF.
The Jpeg format is eaiser to work with and can be compressed to a smaller size for moving across the internet or putting on webpages

Database programs:
There are many out there. All the Microsoft Programs are good. Access and MS-SQL both are good but run only on windows and they have their drawbacks too especially if you migrate to other database formats later on.
I personaly like the MY-SQL database program because it will run both on windows platforms and Linux platforms and the Migration of the data from windows to linux or back is very easy. if you use linux for your Database server then the cost for the server and database is cut dramatically.Linux is free and the MY-SQL Database server software is free. You can also taylor the Database exactly the way you want to .

I wish you success in your efforts to move your data to electronic format.

Andrew Martin


  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Wed Oct 11, 2006 11:55 am 

Joined: Thu Apr 14, 2005 9:34 pm
Posts: 2762
Location: Copenhagen, Denmark
Filemaker Pro is the best choice for your project. It is available for both Macintosh and Windows, and has supported a web server option for years. It is used heavily by publishing and graphic design firms, and has excellent support for storage of pictures (binary large objects) and also has the ability to store a complete data file within the database record. You can actually select a disk file, and "paste" it into a record field, and later export the file from the field and return it to its original form on the disk.

I used it recently to document academic journal articles for a professor. I created a citation database and each record linked to a pdf of the original article, all burned to a CD.

I have also been a professional FileMaker developer for 10 years.

www.filemaker.com

_________________
Steven Harrod
Lektor
Danmarks Tekniske Universitet


Offline
 Profile  
 
 Post subject: PDF vs. Text is really PDF and Text
PostPosted: Wed Oct 11, 2006 10:41 pm 

Joined: Mon Aug 23, 2004 8:10 am
Posts: 2499
I agree Jim, that the plain text is the way to go for many reasons, but the PDF is equally important for preserving the layout of the document. This is why I encourage an archivist to do both.

Rob


Offline
 Profile  
 
 Post subject: Re: PDF vs. Text is really PDF and Text
PostPosted: Thu Oct 12, 2006 8:30 am 

Joined: Thu Jun 16, 2005 11:54 am
Posts: 609
Thanks again for all your excellent comments on this subject. Just as a matter of interest, I have found that when I make high resolution copies of documents it sometimes works well to post the images into PowerPoint as a means of printing and direct display, and that in turn allows both for high resolution printing that will work well with OCR and also for conversion to PDF files. However, sometimes the file sizes that result are rather large.

I noticed today that this topic fits well with several points of the RYPN "purpose" statement on the home page.

_________________
"We Repair No Locomotive Before Its Time"


Offline
 Profile  
 
 Post subject: Re: PDF vs. Text is really PDF and Text
PostPosted: Thu Oct 12, 2006 11:52 am 

Joined: Thu Apr 14, 2005 9:34 pm
Posts: 2762
Location: Copenhagen, Denmark
I think the pdf file format should not be slighted. It is currently the standard for all academic journals at the library, and is also now the required standard for all students to submit their theses and dissertations. Our library (University of Cincinnati), and most others, no longer accept paper dissertations. The state of Ohio has a very aggressive library coalition and collections policy that supports availability (not replacement) of rare archives by pdf through online access "OhioLink".

Technology also exists to "scan" pdf files and retrieve the text, so if this is later desired, it is possible.

A quality scan to pdf will reach the widest audience and be in compliance with current library standards.

There are also archival projects sponsored by libraries for maintaining archives of digital documents. They store backup copies of documents at various alternate locations for "lifetime" protection. I don't know if the format is optical disk, magnetic tape, or very large disk arrays, but I believe the media is constantly rotated and renewed.

If you scan these documents and a major library accepts them into its collection, you will have some comfort that they will maintain the files "forever". This is a safe risk because the cost of storing computer files is continuing to decline, and the cost of storing files is already very small. I would not be surprised if the whole UC library collection could be stored on a single 5 inch disk in 20 years (since we now have the ability to store 5 gb on a keychain).

I agree the quality of microfilm archives is still better than digital, but the cost of storing them and making them accessible is increasing, not decreasing, due to the physical space they occupy and the cost of shipping them in inter-library loan. A color scan of a page is pretty darn close to the functionality of a microfilm, but I admit to occassional frustration when trying to read digital scans sent to me by interlibrary loan (especially when trying to read mathematical equations with small accent marks or subscripts).

_________________
Steven Harrod
Lektor
Danmarks Tekniske Universitet


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Fri Oct 13, 2006 8:43 am 

Joined: Thu Jan 26, 2006 11:08 am
Posts: 47
Location: Bonsal, NC, USA
The digital world is upon us, folks, and there is no denying it is an issue we all must face with our museums. The problem we have, however, is this is one of those subjects where "everybody is an expert" because they once used a digital camera to take a photo of Aunt Minnie, hopefully standing next to a working steam locomotive. *grin*

There have been some excellent suggestions and good information in this thread, but we seem to forget, as usual, we are not alone here. This is an issue faced by many larger museums than any of us have as well, from the Museum of Modern Art in New York City to the British Museum in London, and one they are struggling with as well. Rather than simply bantering back and forth about formats and second-rate software packages, we should take a look at what they have found and are doing about digital access management (DAM).

I took just about five minutes just now to see if I could find any starting point for references on the subject, and was able to find four good links to good information on the subject. They are:

Wikipedia:
http://en.wikipedia.org/wiki/Digital_Asset_Management

Journal of Digital Asset Management:
http://www.palgrave-journals.com/dam/index.html

NYU Guide to Digital Asset Management:
http://www.nyu.edu/its/humanities/ninchguide/XIII/

NC ECHO - Preservation Metadata for Digital Objects:
http://www.ncecho.org/presmet/pmdo.htm

I suggest anyone truly interested in properly preserving their collections with DAM should start by reading through those four webpages, and get a real idea of the scope of the project. This should never seem a daunting prospect, but if we are going to do it, we should do it correctly.

If I may be allowed one paragraph of ranting here... I see all too many so-called "museums" in the list of railroad museums acting like just a another railfan trying to house a slide collection. There is little thought given to doing it correctly, and no research into how to do it properly. If we are museums, and some of us have made the cut already, we should act like it in a professional manner.

I freely admit the collections here at our facility are not as good as they could be, but we are working on it seriously now, and are making progress.

Another good website to check is that of the American Association of Museums, the group doing accreditations of museums. You may not want to join or ever do a full accreditation process with them, but just a look through their public stuff will give you an insight into where we should strive to be.

_________________
Bob Crowley
Corporate Secretary
New Hope Valley Railway
North Carolina Railroad Museum
East Carolina Chapter, NRHS
Bonsal, NC, USA
www.nhvry.org


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Fri Oct 13, 2006 4:57 pm 

Joined: Sun Aug 22, 2004 1:51 pm
Posts: 11501
Location: Somewhere east of Prescott, AZ along the old Santa Fe "Prescott & Eastern"
Stepping into this a bit late, let me approach this from a different perspective:

What is your ultimate goal in digitizing these documents? Is it being able to index/search them? Is it preservation? Is it making a duplicate set to house elsewhere, or making a set so you can give the originals back to their owner? Is it electronic/online transmission/accessibility?

The intended mission should play a large part in the decision process.

PDF's big selling point for years has been the fact that the average Joe cannot download the document and then alter it to suit other nefarious needs (such as dropping the word "not" from one of the Ten Commandments or inserting another clause into an application), as one could with a text document. If security is not an issue (i.e. is anyone going to alter the dimensions of the journal box in a truck drawing?), is Adobe PDF strictly necessary?

TIFF is the gold standard for imaging--and is also a byte hog. If you are putting things online for download, would bitmap or JPEG do just as well at one-tenth or less the file size?


Offline
 Profile  
 
 Post subject: Re: Electronic Document Storage and Indexing
PostPosted: Fri Oct 13, 2006 6:19 pm 

Joined: Thu Jun 16, 2005 11:54 am
Posts: 609
Sandy, I would have to say that the answer to your questions about the goal in digitizing the documents is "all of the above" plus conservation of space. In their present paper form they take up most of two rooms about 12x12 feet each, in fact they occupy so much space in the rooms that it can be difficult to get to items and there is really not enough space to work with some of them (gosh, I heard exactly the same thing about another archive very recently). There are thousands of documents generated over a span of 70 years, the older ones obviously needing electronic preservation as a backup to possible future deterioration. I have started working on the project, but the time I can devote to it is limited, so my mission is to electronically archive the material as quickly as practical giving priority to the items that are likely to deteriorate first and/or have the greatest information value. The assignment of priority is of course a judgement, but since I participated in the original production of the materials I am pretty well qualified to make those choices.

_________________
"We Repair No Locomotive Before Its Time"


Offline
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 25 posts ]  Go to page 1, 2  Next

All times are UTC - 5 hours [ DST ]


 Who is online

Users browsing this forum: W3C [Validator], whodom and 303 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: