Skip to content
This repository has been archived by the owner on Dec 22, 2020. It is now read-only.

String not valid UTF-8 (BSON::InvalidStringEncoding) #92

Open
dcu opened this issue Apr 15, 2015 · 4 comments
Open

String not valid UTF-8 (BSON::InvalidStringEncoding) #92

dcu opened this issue Apr 15, 2015 · 4 comments

Comments

@dcu
Copy link

dcu commented Apr 15, 2015

I have the following exception when importing a collection, the data should be valid since it is already present in the database.

Any ideas?

    /var/lib/gems/1.9.1/gems/bson-1.10.2/lib/bson/bson_c.rb:20:in `serialize': String not valid UTF-8 (BSON::InvalidStringEncoding)
    from /var/lib/gems/1.9.1/gems/bson-1.10.2/lib/bson/bson_c.rb:20:in `serialize'
    from /var/lib/gems/1.9.1/gems/bson-1.10.2/lib/bson.rb:19:in `serialize'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/schema.rb:212:in `transform'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:148:in `block (3 levels) in import_collection'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/cursor.rb:335:in `each'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:147:in `block (2 levels) in import_collection'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:71:in `block in with_retries'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:69:in `times'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:69:in `with_retries'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:146:in `block in import_collection'
    from /var/lib/gems/1.9.1/gems/mongo-1.10.2/lib/mongo/collection.rb:291:in `find'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:145:in `import_collection'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:123:in `block (2 levels) in initial_import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:121:in `each'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:121:in `block in initial_import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:109:in `each'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:109:in `initial_import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/streamer.rb:28:in `import'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/cli.rb:167:in `run'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/lib/mosql/cli.rb:16:in `run'
    from /var/lib/gems/1.9.1/gems/mosql-0.4.2/bin/mosql:5:in `<top (required)>'
    from /usr/local/bin/mosql:23:in `load'
    from /usr/local/bin/mosql:23:in `<main>'

Please note this is failing even with the --unsafe flag.

@dcu
Copy link
Author

dcu commented Apr 22, 2015

any update on this one?

@Winslett
Copy link

I had the same issue. I just monkey patched it to remove the invalid k,v from the obj. I replaced the mosql binary with the following, which I call monkey-patched-mosql. Then, I run the ETL process from the following code, which modifies the MoSQL::Schema.transform method. It could be cleaned up by using a super.

The ETL errors from my data were caused by binary values and larger than expected BSON documents.

#!/usr/bin/env ruby

require 'mosql/cli'

module MoSQL
  class Schema
    def transform(ns, obj, schema=nil, depth = 0)
      schema ||= find_ns!(ns)

      original = obj

      # Do a deep clone, because we're potentially going to be
      # mutating embedded objects.
      obj = BSON.deserialize(BSON.serialize(obj))

      row = []
      schema[:columns].each do |col|

        source = col[:source]
        type = col[:type]

        if source.start_with?("$")
          v = fetch_special_source(obj, source, original)
        else
          v = fetch_and_delete_dotted(obj, source)
          case v
          when Hash
            v = JSON.dump(Hash[v.map { |k,v| [k, transform_primitive(v)] }])
          when Array
            v = v.map { |it| transform_primitive(it) }
            if col[:array_type]
              v = Sequel.pg_array(v, col[:array_type])
            else
              v = JSON.dump(v)
            end
          else
            v = transform_primitive(v, type)
          end
        end
        row << v
      end

      if schema[:meta][:extra_props]
        extra = sanitize(obj)
        row << JSON.dump(extra)
      end

      log.debug { "Transformed: #{row.inspect}" }

      row
    rescue BSON::InvalidStringEncoding, BSON::InvalidDocument
      obj = obj.select do |k,v|
        begin
          BSON.deserialize(BSON.serialize({"#{k}" => v}))
          true
        rescue BSON::InvalidStringEncoding, BSON::InvalidDocument
          puts "Pruning #{k} from the hash."
          false
        end
      end

      raise "tried and failed to prune with #{[ns, obj, schema]}" if depth > 2
      transform(ns, obj, schema, depth + 1)
    end
  end
end


MoSQL::CLI.run(ARGV)

@jtmarmon
Copy link

jtmarmon commented May 6, 2015

+1 - anyone know what would cause this? I checked the timestamp that it appears to be failing on and I don't see any issues

@jtmarmon
Copy link

jtmarmon commented May 6, 2015

looks like there was a PR open to resolve this here: #83 which broke tests.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants