diff options
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 178 |
1 files changed, 178 insertions, 0 deletions
@@ -0,0 +1,178 @@ +General +======= + + * See if I can support the encryption format used with /R 5 /V 5, + even though a qpdf-announce subscriber with an adobe.com email + address mentioned that this is deprecated. There is also a new + encryption format coming in a future release, which may be better + to support. As of the qpdf 3.0 release, the specification was not + publicly available yet. + + * Consider the possibility of doing something locale-aware to support + non-ASCII passwords. Update documentation if this is done. + + * Look for %PDF header somewhere within the first 1024 bytes of the + file. Also accept headers of the form "%!PS−Adobe−N.n PDF−M.m". + See Implementation notes 13 and 14 in appendix H of the PDF 1.7 + specification. This is bug 3267974. + + * Consider impact of article threads on page splitting/merging. + Subramanyam provided a test file; see ../misc/article-threads.pdf. + Email Q-Count: 431864 from 2009-11-03. Other things to consider: + outlines, page labels, thumbnails, zones. There are probably + others. + + * See whether it's possible to remove the call to + flattenScalarReferences. I can't easily figure out why I do it, + but removing it causes strange test failures in linearization. I + would have to study the optimization and linearization code to + figure out why I added this to begin with and what in the code + assumes it's the case. For enqueueObject and unparseChild in + QPDFWriter, simply removing the checks for indirect scalars seems + sufficient. Looking back at the branch in the apex epub + repository, before flattening scalar references, there was special + case code in QPDFWriter to avoid writing out indirect nulls. It's + still not obvious to me why I did it though. + + To pursue this, remove the call to flattenScalarReferences in + QPDFWriter.cc and disable the logic_error exceptions for indirect + scalars. Just search for flattenScalarReferences in QPDFWriter.cc + since the logic errors have comments that mention + flattenScalarReferences. Then run the test suite. Several files + that explicitly test flattening of scalar references fail, but the + indirect scalars are properly preserved and written. But then + there are some linearized files that have a bunch of unreferenced + objects that contain scalars. Need to figure out what these are + and why they're there. Maybe they're objects that used to be + stream lengths. Probably we just need to make sure don't traverse + through a stream's /Length stream when enqueueing stream + dictionaries. This could potentially happen with any object that + QPDFWriter replaces when writing out files. Such objects would be + orphaned in the newly written file. This could be fixed, but it + may not be worth fixing. + + If flattenScalarReferences is removed, a new method will be needed + for checking PDF files. + + * See if we can avoid preserving unreferenced objects in object + streams even when preserving the object streams. + + * For debugging linearization bugs, consider adding an option to save + pass 1 of linearization. This code is sufficient. Change the + interface to allow specification of a pass1 file, which would + change the behavior as in this patch. + +------------------------------ +Index: QPDFWriter.cc +=================================================================== +--- QPDFWriter.cc (revision 932) ++++ QPDFWriter.cc (working copy) +@@ -1965,11 +1965,15 @@ + + // Write file in two passes. Part numbers refer to PDF spec 1.4. + ++ FILE* XXX = 0; + for (int pass = 1; pass <= 2; ++pass) + { + if (pass == 1) + { +- pushDiscardFilter(); ++// pushDiscardFilter(); ++ XXX = fopen("/tmp/pass1.pdf", "w"); ++ pushPipeline(new Pl_StdioFile("pass1", XXX)); ++ activatePipelineStack(); + } + + // Part 1: header +@@ -2204,6 +2208,8 @@ + + // Restore hint offset + this->xref[hint_id] = QPDFXRefEntry(1, hint_offset, 0); ++ fclose(XXX); ++ XXX = 0; + } + } + } +------------------------------ + + * Handle embedded files. PDF Reference 1.7 section 3.10, "File + Specifications", discusses this. Once we can definitely recognize + all embedded files in a document, we can update the encryption + code to handle it properly. In QPDF_encryption.cc, search for + cf_file. Remove exception thrown if cf_file is different from + cf_stream, and write code in the stream decryption section to use + cf_file instead of cf_stream. In general, add interfaces to get + the list of embedded files and to extract them. To handle general + embedded files associated with the whole document, follow root -> + /Names -> /EmbeddedFiles -> /Names to get to the file specification + dictionaries. Then, in each file specification dictionary, follow + /EF -> /F to the actual stream. There may be other places file + specification dictionaries may appear, and there are also /RF keys + with related files, so reread section 3.10 carefully. + + * The description of Crypt filters is unclear with respect to how to + use them to override /StmF for specific streams. I'm not sure + whether qpdf will do the right thing for any specific individual + streams that might have crypt filters. The specification seems to + imply that only embedded file streams and metadata streams can have + crypt filters, and there are already special cases in the code to + handle those. Most likely, it won't be a problem, but someday + someone may find a file that qpdf doesn't work on because of crypt + filters. There is an example in the spec of using a crypt filter + on a metadata stream. + + For now, we notice /Crypt filters and decode parameters consistent + with the example in the PDF specification, and the right thing + happens for metadata filters that happen to be uncompressed or + otherwise compressed in a way we can filter. This should handle + all normal cases, but it's more or less just a guess since I don't + have any test files that actually use stream-specific crypt filters + in them. + + * The second xref stream for linearized files has to be padded only + because we need file_size as computed in pass 1 to be accurate. If + we were not allowing writing to a pipe, we could seek back to the + beginning and fill in the value of /L in the linearization + dictionary as an optimization to alleviate the need for this + padding. Doing so would require us to pad the /L value + individually and also to save the file descriptor and determine + whether it's seekable. This is probably not worth bothering with. + + * The whole xref handling code in the QPDF object allows the same + object with more than one generation to coexist, but a lot of logic + assumes this isn't the case. Anything that creates mappings only + with the object number and not the generation is this way, + including most of the interaction between QPDFWriter and QPDF. If + we wanted to allow the same object with more than one generation to + coexist, which I'm not sure is allowed, we could fix this by + changing xref_table. Alternatively, we could detect and disallow + that case. In fact, it appears that Adobe reader and other PDF + viewing software silently ignores objects of this type, so this is + probably not a big deal. + + * Pl_PNGFilter is only partially implemented. If we ever decoded + images, we'd have to finish implementing it along with the other + filter decode parameters and types. For just handling xref + streams, there's really no need as it wouldn't make sense to use + any kind of predictor other than 12 (PNG UP filter). + + * If we ever want to have check mode check the integrity of the free + list, this can be done by looking at the code from prior to the + object stream support of 4/5/2008. It's in an if (0) block and + there's a comment about it. There's also something about it in + qpdf.test -- search for "free table". On the other hand, the value + of doing this seems very low since no viewer seems to care, so it's + probably not worth it. + + * QPDFObjectHandle::getPageImages() doesn't notice images in + inherited resource dictionaries. See comments in that function. + + * Based on an idea suggested by user "Atom Smasher", consider + providing some mechanism to recover earlier versions of a file + embedded prior to appended sections. + + * From a suggestion in bug 3152169, consider having an option to + re-encode inline images with an ASCII encoding. + + * From github issue 2, provide more in-depth output for examining + hint stream contents. |