3
$\begingroup$

EDIT : Originally posted as an XML issue, file format was found to be actually JSON, yet the problem is essentially the same: Import worked in 11.2 but not in 11.3. As per Hans' comments, replicated in 11.2.

This JSON file was extracted from PDF. In 11.2 import worked correctly (was able to save it as MX file). (It’s available via this Google drive link since it’s over 10MB)

https://drive.google.com/file/d/13wFNeFwD1tLG4hFiz9vKpZxofpZBg95W/view?usp=sharing

Tried Import w/ no options, with option "JSON", "RawJSON", and "ExpressionJSON" - as well as ImportString - all failed in 11.3.

The original PDF is found here.

$\endgroup$
11
  • 1
    $\begingroup$ @alanccalvitti Are you sure what you posted to google drive is an XML file. It looks like JSON or JSON-like format. $\endgroup$ Commented May 17, 2018 at 23:44
  • $\begingroup$ In v11.2 Import["DMAT201708.json"] took ~14 sec to display the entire contents. Redundant formatting text abound. $\endgroup$ Commented May 17, 2018 at 23:57
  • $\begingroup$ Depending on version of Acrobat reader you should be able to save the pdf file to text. That particular file is throwing "Import::general: Expected cross reference table" error. $\endgroup$ Commented May 18, 2018 at 0:43
  • $\begingroup$ @Hans, that does look like JSON, but Import w/ "JSON", "RawJSON", and "ExpressionJSON" - as well as ImportString - all failed in 11.3 But you were able to import as "JSON" in 11.2? If so, maybe I can just change the title XML-->JSON. (The redundant formatting is indeed there around every word, that's the way it was written). $\endgroup$ Commented May 18, 2018 at 15:54
  • $\begingroup$ Currently using 11.2. I did change the extension. Because of the dimensions of this list it does not display nicely. Mathematica struggles with formatting and display of the contents. The original PDF file is 1.47 MB, I was able to export to text at 773 KB using Acrobat Reader DC all on Windows. Export to RTF ballooned to 9 MB using Acrobat Pro on Mac. $\endgroup$ Commented May 18, 2018 at 18:30

1 Answer 1

4
$\begingroup$

There is a way to fix the invalid UTF8 using:

FromCharacterCode[ToCharacterCode[_String, "UTF-8"]]

Then use ImportString on the result.

In your case:

In[13]:= importedjson = ImportString[FromCharacterCode[
    ToCharacterCode[
        Import["/Users/aarone/Desktop/DMAT201708X.xml", "String"], 
            "UTF-8"]], "RawJSON"];

In[14]:= importedjson[[;; 1, 1 ;; 3]]

Out[14]= {<|"number" -> 1, "pages" -> 394, "height" -> 1188|>}
$\endgroup$
11
  • $\begingroup$ are you using 11.3? I copy-pasted your code and using my local copy of that xml file I get this error: Import::jsonhintposandchar: An error occurred near character '?', at line 1:3 $\endgroup$ Commented May 22, 2018 at 16:17
  • $\begingroup$ Yes, I just checked and I did this in 11.3. $\endgroup$ Commented May 22, 2018 at 16:22
  • $\begingroup$ OS? I'm on High Sierra. $\endgroup$ Commented May 22, 2018 at 16:23
  • $\begingroup$ Same OS here. I get the error you get when I try to import the file directly from the URL. $\endgroup$ Commented May 22, 2018 at 16:24
  • $\begingroup$ Interesting, but I'm not using the URL, I have the local copy DMAT201708X.xml $\endgroup$ Commented May 22, 2018 at 16:37

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.