Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.fromfile() is very slow #58

Open
OleksiiMatiash opened this issue Oct 22, 2023 · 9 comments
Open

np.fromfile() is very slow #58

OleksiiMatiash opened this issue Oct 22, 2023 · 9 comments

Comments

@OleksiiMatiash
Copy link

I'm porting app from python to c# and now I'm trying to choose .net numpy equivalent. Options are NumpyDotNet and NumSharp.
NumpyDotNet is obvious winner because NumSharp has "not implemented" here and there, and absent documentation and samples. But my app needs to read and write lots of data as fast as it is possible, and here is the problem - NumpyDotNet's np.fromfile() is very slow compared to NumSharp's. Here is a benchmark, MB\s:

NumpyDotNet:

np.fromfile(fullFilePath, np.UInt8); ~150

np.fromfile(fullFilePath, np.UInt16); ~140

np.fromfile(fullFilePath, np.UInt32); ~260

byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt8, metadata.dataSize, metadata.dataOffset); ~580

byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt16, metadata.dataSize, metadata.dataOffset); ~690

byte[] bytes = File.ReadAllBytes(fullFilePath);
return np.frombuffer(bytes, np.UInt32, metadata.dataSize, metadata.dataOffset); ~690

(don't mind offset and size, almost whole array is read)

NumSharp:

np.fromfile(fullFilePath, NPTypeCode.Int32); ~2700

reading as NPTypeCode.Int16 is not implemented in NumSharp, so I'm unable to measure.

python numpy:

np.fromfile(file, np.uint16) ~2700

My app mostly works with UInt16 with some Float32 in the middle of calculation chain, so I need effective reading\writing of UInt16.

@KevinBaselinesw
Copy link
Collaborator

I am on vacation so don't have a lot of time to work on this.

I recommend that you use the ToSerializable/FromSerializable methods to save/restore an ndarray. Then you can use .NET standard XML/JSON serialization operations to save/restore

If you really need to use fromfile for some reason and the performance does not meet your needs, I suggest trying to write your own code to open a file and parse/save it.

@KevinBaselinesw
Copy link
Collaborator

use either ndarray.ToSerializable() or np.ToSerializable(ndarray a).

@KevinBaselinesw
Copy link
Collaborator

please look at issue 48 in this repository for example code

@OleksiiMatiash
Copy link
Author

I'm sorry for not mentioning that I need to read\write binary files, i.e. not XML/JSON. To be precise - I need to read file, then create ndarray from this file using some small offset from the start till the end of the file. Do calculations, and write new data to the same file with the same offset. Typical file size is 100 MB, offset - 100 KB.
In python app I'm doing this:
def readImageData(fileName: str, offset: int, length: int) -> ndarray:
return np.fromfile(fileName, dtype = np.uint16, count = length, offset = offset)

So it seems to me that To\FromSerializable is not the right choice.

@KevinBaselinesw
Copy link
Collaborator

below is basically what I am doing internally. I don't have time to write the tofile completely today.
If you can make this go faster, then you have a solution.

       [TestMethod]
        public void test_OleksiiMatiash_1()
        {
            string fileName = "xyz.bin";

            ndarray x = np.arange(0, 25, dtype:np.Int16);


            tofile(x, fileName);

            int length = 100;
            int offset = 10;

            fromfile(fileName, length, offset);
        }

        private void tofile(ndarray x, string fileName)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);


            //using (var fs = fp.Create())
            //{
            //    //return NpyArray_ToBinaryStream(self, fs);

            //    //using (var binaryWriter = new System.IO.BinaryWriter(fs))
            //}.


        }

        private ndarray fromfile(string fileName, int length, int offset)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            Int16[] data = new Int16[length - offset];

            using (var fs = fp.OpenRead())
            {
                fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);

                using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
                {
                    for (int i = 0; i < data.Length; i++)
                    {
                        data[i] = sr.ReadInt16();
                    }
                }
       
            }

            return np.array(data);

        }

@KevinBaselinesw
Copy link
Collaborator

One big difference is that python/C code can very quickly cast an array of Int16 values to a byte array and do a very fast write of the data. .NET does not like it if you try to cast Int16 to byte so you have to write each value in a loop. That will be slower.

@OleksiiMatiash
Copy link
Author

One big difference is that python/C code can very quickly cast an array of Int16 values to a byte array and do a very fast write of the data. .NET does not like it if you try to cast Int16 to byte so you have to write each value in a loop. That will be slower.

Got it, thank you. Thinking now if I can get enough speed with .net at all :(

@KevinBaselinesw
Copy link
Collaborator

Here is another idea. Convert the array to bytes first and then write it to disk. See the example below.
I will leave it to you to measure the performance.

     [TestMethod]
        public void test_OleksiiMatiash_1()
        {
            string fileName = "xyz.bin";

            ndarray x = np.arange(0, 25, dtype:np.Int16);

                   
            tofile(x, fileName);

            int length = 100;
            int offset = 10;

            ndarray y = fromfile(fileName, length, offset);

            return;
        }

        private void tofile(ndarray x, string fileName)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            byte[] b = x.tobytes();


            using (var fs = fp.Create())
            {
        
                using (var binaryWriter = new System.IO.BinaryWriter(fs))
                {
                    binaryWriter.Write(b);
                }
            }


        }

        private ndarray fromfile(string fileName, int length, int offset)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            byte[] data = null;

            using (var fs = fp.OpenRead())
            {
                fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);

                using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
                {
                    data = sr.ReadBytes((length - offset) * sizeof(Int16));
                }
       
            }

            return np.frombuffer(data, dtype: np.Int16);

          

        }

@OleksiiMatiash
Copy link
Author

Here is another idea. Convert the array to bytes first and then write it to disk. See the example below. I will leave it to you to measure the performance.

        private ndarray fromfile(string fileName, int length, int offset)
        {
            System.IO.FileInfo fp = new System.IO.FileInfo(fileName);

            byte[] data = null;

            using (var fs = fp.OpenRead())
            {
                fs.Seek(offset * sizeof(Int16), System.IO.SeekOrigin.Begin);

                using (System.IO.BinaryReader sr = new System.IO.BinaryReader(fs))
                {
                    data = sr.ReadBytes((length - offset) * sizeof(Int16));
                }
       
            }

            return np.frombuffer(data, dtype: np.Int16);
        }

This is the fastest method of all in scope of NumpyDotNet, achieved 810 MB\s.
So reading is fast enough. I'm unable to test writing speed right now, but I hope it will be on pair with reading.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants