A few months ago I worked on a process that imports Facebook Leads into a legacy system. Facebook sends its advertising data as UTF-16 encoded CSV. The tool also had to support the CSV files occasionally being ended by hand, which reverted the encoding to something a bit more standard. Thankfully, there was a small library out there that helped. So, in case you ever find yourself in need of guessing if a file is UTF-16 and don’t want to roll your own, here you go:
<dependency> <groupId>org.codehaus.guessencoding</groupId> <artifactId>guessencoding</artifactId> <version>1.4</version> <type>jar</type> </dependency>
File in = new File(inputFile); if (!in.exists()) { throw new IllegalArgumentException("Input file not found"); } Charset cs = CharsetToolkit.guessEncoding(in, 4096, StandardCharsets.UTF_8); System.out.println("Reading " + inputFile + " as " + cs.name()); Reader r = new InputStreamReader(new FileInputStream(in),cs.name());